I have been thinking about the performance of Wikipedia, and how it might be improved.
Before I go off and investigate in detail, I'd just like to check my basic concept of how the code works, (based on reading this list -- I haven't pulled down the CVS to look at it yet).
=== Total guesswork follows ===
Am I right in thinking that, for each ordinary page request,
* the raw text is pulled out of the database * the taxt is parsed and reformatted * links are looked up to see if they are linked and treated appropriately * final page generation to HTML, with page decorations as per theme is added
My general impressions about activity rate is:
* about 100 pages per day are created or deleted * roughly one edit every 30 seconds * roughly one page hit every second
Packet loss seems negligible, so you don't seem to be running out of bandwidth.
Although I guesstimate the hit rate at around one-per-second, pages seem to be taking around 5 seconds to serve, suggesting that the system is probably running at a loadav of say 5 or so.
My best guess is that the parsing and lookups on regular pages are currently the main load, not editing or exotic database queries -- is this right?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is unlikely to be the bottleneck: it's more likely to be CPU and inter-process locking problems.
If so, I think careful page content caching could greatly improve performance, by reducing the number of page parsings, renderings and lookups across the board, at the cost of a slight increase in the cost of page deletion and creation. However, by freeing up resources, performance should improve across the board on all operations.
If I'm right, I think suitably intelligent caching could be applied not only to ordinary pages, but also to some special pages, without any major redesign or excessive complexity.
Before I start to look at things in more detail, could anyone confirm whether I am even vaguely making sense?
-- Neil
On mer, 2002-04-10 at 05:10, Neil Harris wrote:
My best guess is that the parsing and lookups on regular pages are currently the main load, not editing or exotic database queries -- is this right?
Not a clue. Initially, the database certainly was the main load, but I haven't heard any newer figures. Jimbo?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is unlikely to be the bottleneck: it's more likely to be CPU and inter-process locking problems.
If so, I think careful page content caching could greatly improve performance, by reducing the number of page parsings, renderings and lookups across the board, at the cost of a slight increase in the cost of page deletion and creation. However, by freeing up resources, performance should improve across the board on all operations.
We used to cache rendered articles, but Jimbo disabled this feature some time ago, claiming he was unable to find a performance advantage. (See mailing list archives circa February 13.)
Personally, I've always find that idea suspicious; caching is definitely faster on my test machine, and is going to be a particularly big help with, say, long pages full of HTML tables! But then, my test machine has a much much lower load to deal with than the real Wikipedia. :) Nonetheless, if cacheing really isn't helping, that's because it's not doing something right. It should be found, fixed, and reenabled.
(There were also side issues with the caching -- the meta keyword tags and the interlanguage links didn't get filled out when viewing a cached page. But again, these should be fixed, and aren't a reason for disabling caching altogether.)
If I'm right, I think suitably intelligent caching could be applied not only to ordinary pages, but also to some special pages, without any major redesign or excessive complexity.
For a brief time we cached the contents of RecentChanges when using the default settings, and I believe the Orphans page was manually refreshed; but these were removed after the queries were made more efficient.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org