There are two very good IRC channels on irc.freenode.net:
#php and #mysql
There are lots of knowledgeable people there who are eager to help open source projects, many have already heard of Wikipedia. I learned a lot from just being there for half an hour or so.
Here are some things we need to look into:
1) Composite indexes and index use in general. I do not know which indexes are currently used on the live Wikipedia. However, after the *default* database generation script, there is just a single composite index, namely in the table watchlist. All other indexes are on a single field. If I understand correctly, this means that when we do a query with multiple comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used. At least, that is what this article claims:
http://www.linux-mag.com/cgi-bin/printer.pl?issue=2001-06&article=mysql
There are also a couple of tables with no indexes (including ARCHIVE, which may cause Special:Undelete to create high server load) and some unexplainable ones (site_stats has an index, but only one row). We really need to clean up our indexes. I can't help much with this without server access because I don't know if the table structures have been altered.
2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM, provides low-level access to tables *without locking*. You get reads, but you don't get consistency, which may cause problems when you're dealing with critical, heavily modified data. But it seems perfect for our archive stuff. The HANDLER syntax is explained here: http://www.mysql.com/doc/en/HANDLER.html
We definitely should look into this.
3) Upgrading to MySQL4. MySQL4 supports query caching (also subqueries, but I haven't looked into that), which means that heavily queried pages will load faster. When someone mentioned query caching, a lot of other people made comments like "query caching rocks", so I assume it provides quite some performance benefits in practice.
4) Tuning my.cnf. According to one IRC resident, upping table_cache can greatly increase performance with lots of UPDATEs. If we create new indexes, we may also need to raise index_cache (there's a formula to calculate your index cache efficiency, currently we are at 99.75%, which is pretty good).
5) Caching our article index. I have mentioned this idea before, I think: We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant. There are several ways to accomplish this:
- We could ask one of our resident C programmers to help. There's a specified interface to access persistent resources in MySQL, which is used by functions such as mysql_pconnect. This page describes how it works: http://www.php-center.de/en-html-manual/phpdevel.html
- We could use session variables for the purpose. There is supposed to be an option to keep session stuff in shared memory. This might help with other caching problems as well.
- We could just put a database on a ramdisk for this purpose.
The order of this list reflects the priority at which I think these different questions need to be addressed. Getting our indexes to work properly should IMHO be of greatest importance. Even if EXPLAIN shows the use of an index, we may frequently still require large table scans because we do not use composite indexes.
Regards,
Erik