Benjamin Lees wrote:
The slides for the talk are on the OnScale site < http://www.onscale.de/Reinefeld_Erlang_Exchange.pdf%3E, although I don't see an actual comparison in performance between the distributed architecture and the current Wikipedia setup.
He seems to ignore not only Squid, but also the key-value store MediaWiki is already well-integrated with: memcached. I think he's talking about something more complex (I only understand parts of it), but I don't think Wikipedia is much of a big dumb behemoth as far as architecture goes; I've always thought of it as the opposite, the lean model of incredible performance on an incredibly small budget.
Anyway, he also seems to be assuming that the scalability bottleneck is all in the 2000/s write requests, rather than the 48000/s read requests. Is this actually the case? On the server roles page < https://wikitech.leuksman.com/view/Server_roles%3E I see 10 database servers and hundreds of Apaches/Squids, so I'm dubious.
I think this focus is the point. He ignores the caches because he is most interested in the database performance what happens during and after a write and how to scale that. All the squids and memcached should work with their architecture as well.
From a pragmatic perspective lots of other stuff is missing. I.e. they exclusively use a DHT (key value pairs) for access not full blown SQL (though presumably this could be added).
This is a research project, but if their numbers are right, they are an order of magnitude faster and leaner. Organizational and legal implications aside, a p2p architecture like the Internet itself is really what you would want for a next generation MediaWiki.
Now, all I actually wanted to know was how complete plog4u is for rendering MediaWiki syntax. I guess I shouldn't let my thought wander so much.
Dirk
PS: Would wikitech-l have been a better list to ask this question?