Hello,
I am one of the co-authors of the IEEE Scale paper.
Nice work! My hats are off :)
According to the wikipedia statistics 95% of the
request are handled
by the
squids. And scaling out the squids is not a hard problem. That is
why we only
looked at the render farm and the databases.
Thats for anonymous users. Logged in users still hit the cluster.
Actually, 0.2% requests hitting the backend are saves (surprise!!!!)
There's lots of other stuff done, like previews, searches, watchlists,
actual browsing, various meta stuff, etc.
As I understand the MySQL setup, you are running on a
replicated MySQL
database. Read requests can be answered by any replica and write
requests go
to all replicas. -> Adding more nodes does not increase your write
capacity.
Thats exactly right. We definitely have scaling bottleneck there. Our
easiest way to scale writes is splitting off languages, and it will
take quite some time to hit troubles on any non-English language (and
English is doing fine at the moment too, with relatively low-grade
database hardware).
Actually we used to hit scaling bottleneck once upon a time, back when
we were saving revision texts into core database (and actually main
tables, in pre-1.5 times). It was a single-day (or 15-minute, to add
some dramatic effect) hack to move that stuff out of core databases.
In our setup the replication degree is fixed. Every
item is stored k
times, no
matter how many nodes you are using. So the write capacity is
increasing with
the number of database nodes.
Thats indeed nice for anything what requires simple key-value storage
(and we definitely have such components, such as revision text
storage, or various other simple metadata).
When you update a page you have to update several of
these maps. But
that is
what the transactions are for.
Well, in this case, we end up with plentiful of maps:
Pages by unique ID, pages by name and title, pages by random value
(ha!), pages by length (just in cases).
Every page then has multiple revisions, which are saved by page, by
id, by timestamp, by timestamp-per-page, by timestamp-per-user, and by
timestamp-per-usertext (for anons, mostly).
Every page then links to other pages, and is linked from other pages.
Every page then is embedded as a template somewhere else, or embeds
templates.
Every page is in a category, or is an actual category.
Every page has broken links, that are tracked too.
Every page has images that have to be tracked.
Every page has external links
Every page has ...
And the biggest problem is, that for every of these maps, there're
range scans or multiple reference reads. This leads to reading from
100 nodes (unless lots of clever data clustering is employed) for
every read done.
Of course, complexity of writes, when you don't go after infinite
scaling, is much bigger too.
We looked at several scenarios here. You could run a
p2p system
within your
data center because it scales better, is easier to maintain, etc.
You could
run one p2p overlay over several datacenters (here: Florida,
Amsterdam, South
Korea). Then you have to take care of data placement and network
partitioning. Or you could run the p2p overlay over the users' pcs.
But then
you run into trust issues.
Nah, user's PCs are out of question. We'd really want to see nice
scalable stores for some of our data, which works great with key-value
stores.
As well, we can probably offload some of biggest maps to somewhere
'out there'.
Another problem is that servers die in batches. It is much easier to
put one database server on one power feed and another database server
on another, or place them in separate datacenters.
Once you have 1000 nodes that have to have HA characteristics as a
whole, the lack of understanding which data goes where (think,
availability zones), can lead to data lost, either temporary or
permanently.
Of course, it is matter of engineering, but many of such problems are
not resolved at research state.
--
Domas Mituzas --
http://dammit.lt/ -- [[user:midom]]