I have been thinking about the performance of Wikipedia, and how it might be improved.
Before I go off and investigate in detail, I'd just like to check my basic concept of how the code works, (based on reading this list -- I haven't pulled down the CVS to look at it yet).
=== Total guesswork follows ===
Am I right in thinking that, for each ordinary page request,
* the raw text is pulled out of the database * the taxt is parsed and reformatted * links are looked up to see if they are linked and treated appropriately * final page generation to HTML, with page decorations as per theme is added
My general impressions about activity rate is:
* about 100 pages per day are created or deleted * roughly one edit every 30 seconds * roughly one page hit every second
Packet loss seems negligible, so you don't seem to be running out of bandwidth.
Although I guesstimate the hit rate at around one-per-second, pages seem to be taking around 5 seconds to serve, suggesting that the system is probably running at a loadav of say 5 or so.
My best guess is that the parsing and lookups on regular pages are currently the main load, not editing or exotic database queries -- is this right?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is unlikely to be the bottleneck: it's more likely to be CPU and inter-process locking problems.
If so, I think careful page content caching could greatly improve performance, by reducing the number of page parsings, renderings and lookups across the board, at the cost of a slight increase in the cost of page deletion and creation. However, by freeing up resources, performance should improve across the board on all operations.
If I'm right, I think suitably intelligent caching could be applied not only to ordinary pages, but also to some special pages, without any major redesign or excessive complexity.
Before I start to look at things in more detail, could anyone confirm whether I am even vaguely making sense?
-- Neil
wikitech-l@lists.wikimedia.org