Hello,
I am one of the co-authors of the IEEE Scale paper.
Nice work! My hats are off :)
According to the wikipedia statistics 95% of the request are handled by the squids. And scaling out the squids is not a hard problem. That is why we only looked at the render farm and the databases.
Thats for anonymous users. Logged in users still hit the cluster.
Actually, 0.2% requests hitting the backend are saves (surprise!!!!) There's lots of other stuff done, like previews, searches, watchlists, actual browsing, various meta stuff, etc.
As I understand the MySQL setup, you are running on a replicated MySQL database. Read requests can be answered by any replica and write requests go to all replicas. -> Adding more nodes does not increase your write capacity.
Thats exactly right. We definitely have scaling bottleneck there. Our easiest way to scale writes is splitting off languages, and it will take quite some time to hit troubles on any non-English language (and English is doing fine at the moment too, with relatively low-grade database hardware). Actually we used to hit scaling bottleneck once upon a time, back when we were saving revision texts into core database (and actually main tables, in pre-1.5 times). It was a single-day (or 15-minute, to add some dramatic effect) hack to move that stuff out of core databases.
In our setup the replication degree is fixed. Every item is stored k times, no matter how many nodes you are using. So the write capacity is increasing with the number of database nodes.
Thats indeed nice for anything what requires simple key-value storage (and we definitely have such components, such as revision text storage, or various other simple metadata).
When you update a page you have to update several of these maps. But that is what the transactions are for.
Well, in this case, we end up with plentiful of maps:
Pages by unique ID, pages by name and title, pages by random value (ha!), pages by length (just in cases). Every page then has multiple revisions, which are saved by page, by id, by timestamp, by timestamp-per-page, by timestamp-per-user, and by timestamp-per-usertext (for anons, mostly). Every page then links to other pages, and is linked from other pages. Every page then is embedded as a template somewhere else, or embeds templates. Every page is in a category, or is an actual category. Every page has broken links, that are tracked too. Every page has images that have to be tracked. Every page has external links Every page has ...
And the biggest problem is, that for every of these maps, there're range scans or multiple reference reads. This leads to reading from 100 nodes (unless lots of clever data clustering is employed) for every read done. Of course, complexity of writes, when you don't go after infinite scaling, is much bigger too.
We looked at several scenarios here. You could run a p2p system within your data center because it scales better, is easier to maintain, etc. You could run one p2p overlay over several datacenters (here: Florida, Amsterdam, South Korea). Then you have to take care of data placement and network partitioning. Or you could run the p2p overlay over the users' pcs. But then you run into trust issues.
Nah, user's PCs are out of question. We'd really want to see nice scalable stores for some of our data, which works great with key-value stores. As well, we can probably offload some of biggest maps to somewhere 'out there'.
Another problem is that servers die in batches. It is much easier to put one database server on one power feed and another database server on another, or place them in separate datacenters. Once you have 1000 nodes that have to have HA characteristics as a whole, the lack of understanding which data goes where (think, availability zones), can lead to data lost, either temporary or permanently.
Of course, it is matter of engineering, but many of such problems are not resolved at research state.