Hi,
Just wanted to share some of bits we've been doing this week - it was hopping around and analyzing our performance and application workflow from multiple sides (kind of "Hello 2008!!!" systems performance review).
It all started with application object cache - the caching arena was bumped up from 55GB to 160G - and here more work had to be done to make our parser output cacheable. Any use of magic words (and most templates do use them) would decrease cache TTLs to 1 hour, so vast increase in caching space didn't help much. Though, once this was fixed, pages are reparsed just once few days. Additionally, we did move the revision text caching for external storages to a global pool, instead of maintaining local caches on each of these nodes. That allows to reuse old external store box memory space for caching more actively fetched revisions, instead those archived ones.
Another major review was done on extension loading - there by delaying or eliminating expensive initializations, especially for very-rarely-used extensions (relatively :) - we did shave at least 20ms off site base loading time (and service request average). That also resulted in huge CPU use reduction. Here special thanks goes to folks on #mediawiki (Aaron, Nikerabbit, siebrand, Simetrical, and others) who joined this effort of analysis, education and engineering :) There're still more difficult extensions to handle, but I hope they will evolve into more adaptive performance-wise. This was long-time regression caused by increasing quality of translations - that resulted in bigger data set to handle at every page load.
A small bit, but noticeable, was simplification of mediawiki:pagecategories message on en.wikipedia.org. Such simple logic like "show Category: if there is just one category, and Categories: otherwise" needs a parser to be used, which invokes lots and lots of overhead for every page served. Those few milliseconds needed for that absolutely grammatically correct label could be counted in thousands of dollars. :)
There were few other victims in this unequal fight. TitleBlacklist didn't survive the performance audit, - the current architecture of this feature is doing work in places it never should do, and as initial performance guidelines for it were not followed, it got disabled for a while. Also some of CentralNotice functionality was not optimized for work it was used after the fundraiser, so for now this feature is disabled. Of course, these features will be enabled - but they just need more work before they can run live.
On another front - in software core part - database connection flow was reviewed - and few adjustments were made, which reduce master server load quite a bit, as well as less communication is done with all database servers (transaction coordination was too verbose before - now it is far more lax).
Here again, some of application flow still is irrational - and might have quite a bit of refactoring/fixing in future. Tim pointed out that my knowledge of xdebug profiler is seriously outdated (my mind was stuck at 2.0.1 features, where 2.0.2 introduced quite significant changes that make life easier) ;-) Another shocking revelation was that CPU microbenchmarks provided by MediaWiki internal profiler were not accurate at all - the getrusage() call we use provides information rounded at 10ms each - and most of functions execute far faster than that. It was really amusing, that I trusted numbers, which were similar to rational and reasonable ones only because of huge profiling scale and eventual statistical magic. This a bit complicates profiling in general - as there's no easy way to determine which wait happened because of i/o blocking or context switches.
Few images from the performance analysis work: http://flake.defau.lt/mwpageview.png http://flake.defau.lt/mediawikiprofile.png (somewhere here you should see why TitleBlacklist died)
This one made me giggle: http://flake.defau.lt/mwmodernart.png
Tim was questioning here if people are using wikitext for scientific calculations, or was that just another crazy over-templating we are used to see. Such templates as Commons' 'picture of the day' one cause such output =) Actually - the new parser code makes far nicer graphs (at least, from performance engineering perspective).
And one of biggest changes happened on our Squid caching layer - because of how different browsers request data, we generally had different cache sets for IE, Firefox, Opera, Googlebot, KHTML, etc. Now we do normalize the 'accept encoding' specified by browsers, what makes most of connections fall into single class. In theory this may at least double our caching efficiency. In practice, we will see - the change has been live just on one cluster just for few hours. As a side effect we turned off 'refresh' button on your browsers. Sorrty - please let us know if anything is seriously wrong with that (if you feel offended about your constitutional refreshing rights - use purge instead :)
Additionally I've heard there has been quite a bit of development in new parser, as well as networking in Amsterdam ;-)
Quite a few people also noticed the huge flamewar of 'oh noes, dev enabled a feature despite our lack of consensus' . Now we're sending people to board for all the minor changes they ask for :-)
Oh, and Mark changed the scale on our 'backend service time' graph, which is used to measure our health and performance - now the upper limit is at 0.3s (used to be our minimum few years ago) instead of old 1s: http://www.nedworks.org/~mark/reqstats/svctimestats-weekly.png
So, that much of fun we've seen this week in site operations :)
Cheers, Domas
P.S. I'll spend next week in Disneyworld instead ;-)~~