With the recent increases in traffic (which fit well with my estimates earlier) I think we should be giving some thought now to how the Wikipedia can be made massively scalable. If we start now, we should have a solution available before it is needed, rather than hitting a crunch point and then fixing things in a hurry.
Some thoughts on this:
* The single image of the database is the ultimate bottleneck. This should be held in RAM as far as possible. Thought should be given to using PostgreSQL rather than MySQL, which should scale better in heavily loaded cases.
* Wikipedia is read an order or magnitude more often than it is written. Caching can give a major performance boost, and will help palliate underlying scaling problems.
* Wikipedia read accesses appear to obey [[Zipf's law]]: most pages are read relatively infrequently, and a few very frequently. However, most of the load is from the aggregated accesses of the many low-traffic pages. Therefore, a 'hot spot' cache will not work: the cache has to be big enough to encompass the whole Wikipedia. Again, the cache will have to be held in RAM to avoid the 1000-times slower performance of disk accesses.
* Web serving consumes RAM: a slow modem transfer of a page will lock down resources in RAM for the duration of a page load. At one end of the scale, this is simply socket buffers. At the other end of the scale, it can be a page-rendering thread and its stack, unable to progress to do other work because it can't commit its output to the lower levels of the web server. On low-traffic sites, this is insignificant. On high-traffic sites, it can become a bottleneck. Machines with vast amounts of RAM are very expensive: it is cheaper to buy a number of smaller machines with reasonable amounts of RAM in -- and you get the benefit of extra CPUs for nothing.
=== The grand vision ===
For all these reasons, I propose a solution in which the database is kept on a separate server from the user-facing servers. The extra overhead involved in doing this should be repaid many times over in the long run by the fact that the front-end servers will take more than 90% of the load off the central server.
The front-end servers should be kept _dumb_, and all the control logic, program code, and configuration files should reside on the master server. This removes the need to configure vast numbers of machines in sync.
The front-end server code can also be kept small and tight, with most of the work (page rendering, parsing, etc.) being done by the existing code. (Later, rendering and parsing could be moved to the front-end servers if needed).
=== How to get there easily ===
The first step is to implement a separate caching server.
The system can even be trialled now, by running the master and slave servers on the same box! The existing Wikipedia just needs to have a special skin that serves up 'raw' rendered pages, without any page decorations, CSS, or whatever.
The caching server will deal with all user interaction: cookies, skins, etc. It should serve up 'raw' pages decorated with user-specific content and style sheets.
The caching system could be put on an experimental server, or later put on the main site, as a beta test.
There is only one other thing needed to implement this. Each page in the central DB needs to have a 'last changed' time, and a way to get hold of it. Now, this is where things get slightly tricky. This 'last changed' timestamp should be updated, not only when the page itself is edited, but also when any page linked in the page is created or deleted. That is to say: when creating or deleting a page, timestamps on all pages that link to that page should be touched.
Note that there is no need to touch other pages' timestamps when a page is edited. This is good, as otherwise the extra overhead of all that marking would be unreasonably high.
Now, this last-modified timestamp can be determined using a special request format: the underlying HTTP mechanism could be used, but this is too awkward to fiddle with as a way of supporting experimental software. Better to do something like:
http://www.wikipedia.com/wiki/last_changed?Example_article
(which even avoids invoking the main code at all)
result: a page with the content:
20020417120254
=== Criticisms ===
But, I hear you say -- surely you have added another bottleneck? For every page hit, you still generate a hit on the central DB for the 'last changed' timestamp. The answer is: yes, but -- the timestamp db is very small, performs a very simple lightweight operation, has a single writer and multiple readers, and can therefore later be made separate from the main DB, replicated, etc.
So, there are some ideas.
* Client/server database split * Front-end/back-end script split
Does anyone have any idea how reasonable this is as an idea, based on refactoring retrofitting the existing script? I think that if things were done slowly, the front-end code could be made about 10% of the overall complexity, mostly refactored from existing code.
-- Neil