There are two very good IRC channels on irc.freenode.net:
#php and #mysql
There are lots of knowledgeable people there who are eager to help open source projects, many have already heard of Wikipedia. I learned a lot from just being there for half an hour or so.
Here are some things we need to look into:
1) Composite indexes and index use in general. I do not know which indexes are currently used on the live Wikipedia. However, after the *default* database generation script, there is just a single composite index, namely in the table watchlist. All other indexes are on a single field. If I understand correctly, this means that when we do a query with multiple comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used. At least, that is what this article claims:
http://www.linux-mag.com/cgi-bin/printer.pl?issue=2001-06&article=mysql
There are also a couple of tables with no indexes (including ARCHIVE, which may cause Special:Undelete to create high server load) and some unexplainable ones (site_stats has an index, but only one row). We really need to clean up our indexes. I can't help much with this without server access because I don't know if the table structures have been altered.
2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM, provides low-level access to tables *without locking*. You get reads, but you don't get consistency, which may cause problems when you're dealing with critical, heavily modified data. But it seems perfect for our archive stuff. The HANDLER syntax is explained here: http://www.mysql.com/doc/en/HANDLER.html
We definitely should look into this.
3) Upgrading to MySQL4. MySQL4 supports query caching (also subqueries, but I haven't looked into that), which means that heavily queried pages will load faster. When someone mentioned query caching, a lot of other people made comments like "query caching rocks", so I assume it provides quite some performance benefits in practice.
4) Tuning my.cnf. According to one IRC resident, upping table_cache can greatly increase performance with lots of UPDATEs. If we create new indexes, we may also need to raise index_cache (there's a formula to calculate your index cache efficiency, currently we are at 99.75%, which is pretty good).
5) Caching our article index. I have mentioned this idea before, I think: We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant. There are several ways to accomplish this:
- We could ask one of our resident C programmers to help. There's a specified interface to access persistent resources in MySQL, which is used by functions such as mysql_pconnect. This page describes how it works: http://www.php-center.de/en-html-manual/phpdevel.html
- We could use session variables for the purpose. There is supposed to be an option to keep session stuff in shared memory. This might help with other caching problems as well.
- We could just put a database on a ramdisk for this purpose.
The order of this list reflects the priority at which I think these different questions need to be addressed. Getting our indexes to work properly should IMHO be of greatest importance. Even if EXPLAIN shows the use of an index, we may frequently still require large table scans because we do not use composite indexes.
Regards,
Erik
On ven, 2003-01-31 at 17:15, Erik Moeller wrote:
- Composite indexes and index use in general.... If I
understand correctly, this means that when we do a query with multiple comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used.
Might want to check into that, yes...
There are also a couple of tables with no indexes (including ARCHIVE, which may cause Special:Undelete to create high server load) and some unexplainable ones (site_stats has an index, but only one row).
Archive is very rarely used. But, it probably should have an index stuck on the namespace & title fields at least.
We really need to clean up our indexes. I can't help much with this without server access because I don't know if the table structures have been altered.
They should all be the same...
- The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but you don't get consistency, which may cause problems when you're dealing with critical, heavily modified data.
Now you're scaring me. :)
- Upgrading to MySQL4. MySQL4 supports query caching (also subqueries,
but I haven't looked into that), which means that heavily queried pages will load faster. When someone mentioned query caching, a lot of other people made comments like "query caching rocks", so I assume it provides quite some performance benefits in practice.
I'm planning to do that this weekend. (Weekend nights are relatively low traffic. Relatively. ;)
- Tuning my.cnf. According to one IRC resident, upping table_cache can
greatly increase performance with lots of UPDATEs. If we create new indexes, we may also need to raise index_cache (there's a formula to calculate your index cache efficiency, currently we are at 99.75%, which is pretty good).
Attached are current my.cnf file and SHOW STATUS and SHOW VARIABLES output. More tweaking suggestions are welcome.
- Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant. There are several ways to accomplish this:
I'm still dubious about consistency, but if someone can figure it out, great!
- We could use session variables for the purpose. There is supposed to be
an option to keep session stuff in shared memory. This might help with other caching problems as well.
It's been suggested that cookies/session vars be used for user prefs.
- We could just put a database on a ramdisk for this purpose.
MySQL has a 'HEAP' table type which is in-memory; we wouldn't have to set up a separate ramdisk. It's not persistent, so we'd either have to pre-fill the whole thing or fall back to checking 'cur' on demand.
-- brion vibber (brion @ pobox.com)
On Sat, Feb 01, 2003 at 02:15:00AM +0100, Erik Moeller wrote:
There are two very good IRC channels on irc.freenode.net:
#php and #mysql
There are lots of knowledgeable people there who are eager to help open source projects, many have already heard of Wikipedia. I learned a lot from just being there for half an hour or so.
Here are some things we need to look into:
- Composite indexes and index use in general. I do not know which indexes
are currently used on the live Wikipedia. However, after the *default* database generation script, there is just a single composite index, namely in the table watchlist. All other indexes are on a single field. If I understand correctly, this means that when we do a query with multiple comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used.
That is correct, and you could be right about this (I don't know what the current database schema looks exactly like, but last time I was involved this was still ok.) But keep in mind that a composite index also adds a lot of overhead for updates, takes up more room in the index cache, and doesn't really add much speed if the extra columns don't have a high selectivity, i.e., filter out many unwanted records.
- The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but you don't get consistency, which may cause problems when you're dealing with critical, heavily modified data. But it seems perfect for our archive stuff.
If there are not many updates then this accomplishes next to nothing, except that you have made the software more complicated.
- Upgrading to MySQL4.
Actually I was once in favour of that but Lee convinced me that as long as the developers of MySQL don't think it is ready for production, we shouldn't either. There is only one thing more important than speed and that is stability.
- Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant.
What makes you think it isn't already? I suspect the biggest part of the involved tables is in the cache anyway, so it is probably pretty fast already. So again, please make sure that this is really worth making the software harder to understand. Keeping the software straightforward and understandable and for newcomers that also want to help with programming is a big priority for any open source project.
-- Jan Hidders
.---------------------------------------------------------------------. | Post-doctoral researcher e-mail: jan.hidders@ua.ac.be | | Dept. Math. & Computer Science tel: (+32) 3 820 2427 | | University of Antwerp, room J1.06 fax: (+32) 3 820 2421 | | (UIA) Universiteitsplein 1, B-2610 Wilrijk - Antwerpen (Belgium) | `---------------------------------------------------------------------'
That is correct, and you could be right about this (I don't know what the current database schema looks exactly like, but last time I was involved this was still ok.) But keep in mind that a composite index also adds a lot of overhead for updates, takes up more room in the index cache, and doesn't really add much speed if the extra columns don't have a high selectivity, i.e., filter out many unwanted records.
True. However, I suspect with some of our queries, our existing single- field index only has very low selectivity, which makes stuff like "Most wanted" and "History" so slow. Not using any composite indexes seems like a bad idea to me, especially given the fact that some of our queries are run on huge random-sequence tables.
- The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but you don't get consistency, which may cause problems when you're dealing with critical, heavily modified data. But it seems perfect for our archive stuff.
If there are not many updates then this accomplishes next to nothing, except that you have made the software more complicated.
HANDLER works almost exactly the same as SELECT from a syntax POV. We have many updates on the OLD table, and it is still in MyISAM, meaning table- level locking. So unless we upgrade that particular table to InnoDB (which Brion seems reluctant to do because of the long downtime), HANDLER may be worth giving a try.
However, your point regarding complexity is well taken. We could abstract the HANDLER function in wfQuery using a "consistent read" parameter, by default being true.
- Upgrading to MySQL4.
Actually I was once in favour of that but Lee convinced me that as long as the developers of MySQL don't think it is ready for production, we shouldn't either. There is only one thing more important than speed and that is stability.
They are currently labeling it as a "gamma". I think it might be worth giving a try, we can always downgrade if it creates problems.
- Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant.
What makes you think it isn't already?
I have done tests on the link lookup with a local database and it's very slow with link-heavy pages. This is also what our profiling has shown. Again, that may have something to do with our index use and we may want to look into that. Note that I haven't yet benchmarked the new prefill mechanism.
I suspect the biggest part of the involved tables is in the cache anyway,
I don't trust automatic caching, especially since our CUR table is huge and it is only a very small subset of it (the titles) which we need in memory.
Keeping the software straightforward and understandable and for newcomers that also want to help with programming is a big priority for any open source project.
The current link caching mechanism is far from easy to understand, it might just be considerably less efficient than the alternatives.
Regards,
Erik
On Sat, Feb 01, 2003 at 01:14:00PM +0100, Erik Moeller wrote:
That is correct, and you could be right about this (I don't know what the current database schema looks exactly like, but last time I was involved this was still ok.) But keep in mind that a composite index also adds a lot of overhead for updates, takes up more room in the index cache, and doesn't really add much speed if the extra columns don't have a high selectivity, i.e., filter out many unwanted records.
True. However, I suspect with some of our queries, our existing single- field index only has very low selectivity, which makes stuff like "Most wanted" and "History" so slow.
Adding composite indexes won't help there. What you need is an extra redundant table that contains the number of wanted links.
- The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but you don't get consistency, which may cause problems when you're dealing with critical, heavily modified data. But it seems perfect for our archive stuff.
If there are not many updates then this accomplishes next to nothing, except that you have made the software more complicated.
HANDLER works almost exactly the same as SELECT from a syntax POV.
That's only part of the problem. If anyone wants to mess with this they have to understand what it does and why. That doesn't go away if you have similar syntaxes.
- Upgrading to MySQL4.
Actually I was once in favour of that but Lee convinced me that as long as the developers of MySQL don't think it is ready for production, we shouldn't either. There is only one thing more important than speed and that is stability.
They are currently labeling it as a "gamma". I think it might be worth giving a try, we can always downgrade if it creates problems.
Not if we start using MySQL4 stuff and the problems only start to appear after a while. I agree that the risc is very small, but still.
- Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant.
What makes you think it isn't already?
I have done tests on the link lookup with a local database and it's very slow with link-heavy pages. This is also what our profiling has shown.
Could you be a bit more specific about that?
I suspect the biggest part of the involved tables is in the cache anyway,
I don't trust automatic caching, especially since our CUR table is huge and it is only a very small subset of it (the titles) which we need in memory.
For an existence query it only needs the index, not the table itself. Did you check if we are doing something silly here so that the database has to access the table anyway? What does EXPLAIN say here?
-- Jan Hidders
.---------------------------------------------------------------------. | Post-doctoral researcher e-mail: jan.hidders@ua.ac.be | | Dept. Math. & Computer Science tel: (+32) 3 820 2427 | | University of Antwerp, room J1.06 fax: (+32) 3 820 2421 | | (UIA) Universiteitsplein 1, B-2610 Wilrijk - Antwerpen (Belgium) | `---------------------------------------------------------------------'
Adding composite indexes won't help there. What you need is an extra redundant table that contains the number of wanted links.
We have a table that contains broken links, which is used in the Most Wanted query. I have no information about the size of these tables in our current database, but I suspect that, again, indexes are not properly used, otherwise the bad performance of this query (which has prompted Magnus to cache the whole page and Brion to create a "miser" mode in which it cannot be loaded) is not explainable.
HANDLER works almost exactly the same as SELECT from a syntax POV.
That's only part of the problem. If anyone wants to mess with this they have to understand what it does and why. That doesn't go away if you have similar syntaxes.
I think your arguments against HANDLER are misguided. We clearly have situations where we do not nead consistent reads, and in these situations, HANDLER may be a great help. As I said, we can abstract the functionality to reduce complexity.
They are currently labeling it as a "gamma". I think it might be worth giving a try, we can always downgrade if it creates problems.
Not if we start using MySQL4 stuff and the problems only start to appear after a while. I agree that the risc is very small, but still.
How about using it on meta only? It is reasonably active, but not mission critical.
- Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently* (currently the linkcache is filled for each request), which should make link lookup almost instant.
What makes you think it isn't already?
I have done tests on the link lookup with a local database and it's very slow with link-heavy pages. This is also what our profiling has shown.
Could you be a bit more specific about that?
see http://www.wikipedia.org/pipermail/wikitech-l/2002-November/001166.html http://www.wikipedia.org/pipermail/wikitech-l/2003-January/002238.html
Regards,
Erik
On Sat, Feb 01, 2003 at 04:44:00PM +0100, Erik Moeller wrote:
Adding composite indexes won't help there. What you need is an extra redundant table that contains the number of wanted links.
We have a table that contains broken links, which is used in the Most Wanted query.
I know. I was the one who originally introduced that table. :-)
I have no information about the size of these tables in our current database, but I suspect that, again, indexes are not properly used, otherwise the bad performance of this query (which has prompted Magnus to cache the whole page and Brion to create a "miser" mode in which it cannot be loaded) is not explainable.
I can assure you that the indices are well-defined there and used in the proper way. Some queries are simply hard to compute and can only be improved by precomputing stuff in redundant tables.
HANDLER works almost exactly the same as SELECT from a syntax POV.
That's only part of the problem. If anyone wants to mess with this they have to understand what it does and why. That doesn't go away if you have similar syntaxes.
I think your arguments against HANDLER are misguided. We clearly have situations where we do not nead consistent reads, and in these situations, HANDLER may be a great help. As I said, we can abstract the functionality to reduce complexity.
As I said, syntactically hiding the complexity doesn't make it go away.
They are currently labeling it as a "gamma". I think it might be worth giving a try, we can always downgrade if it creates problems.
Not if we start using MySQL4 stuff and the problems only start to appear after a while. I agree that the risc is very small, but still.
How about using it on meta only? It is reasonably active, but not mission critical.
Good idea, but I assumed that Brion was going to test it on the test server anyway.
I have done tests on the link lookup with a local database and it's very slow with link-heavy pages. This is also what our profiling has shown.
Could you be a bit more specific about that?
see http://www.wikipedia.org/pipermail/wikitech-l/2002-November/001166.html http://www.wikipedia.org/pipermail/wikitech-l/2003-January/002238.html
Ok. I commented already on that in my other mail. Btw. did you see Lee's original comments in linkcache.doc in /docs on why the caching is done the way it is done?
-- Jan Hidders
.---------------------------------------------------------------------. | Post-doctoral researcher e-mail: jan.hidders@ua.ac.be | | Dept. Math. & Computer Science tel: (+32) 3 820 2427 | | University of Antwerp, room J1.06 fax: (+32) 3 820 2421 | | (UIA) Universiteitsplein 1, B-2610 Wilrijk - Antwerpen (Belgium) | `---------------------------------------------------------------------'
On sab, 2003-02-01 at 07:44, Erik Moeller wrote:
Adding composite indexes won't help there. What you need is an extra redundant table that contains the number of wanted links.
We have a table that contains broken links, which is used in the Most Wanted query. I have no information about the size of these tables in our current database, but I suspect that, again, indexes are not properly used, otherwise the bad performance of this query (which has prompted Magnus to cache the whole page and Brion to create a "miser" mode in which it cannot be loaded) is not explainable.
EXPLAIN SELECT bl_to, COUNT( DISTINCT bl_from ) as nlinks FROM brokenlinks GROUP BY bl_to HAVING nlinks > 1 ORDER BY nlinks DESC LIMIT 0, 50; +-------------+-------+---------------+-------+---------+------+--------+---------------------------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------------+-------+---------------+-------+---------+------+--------+---------------------------------+ | brokenlinks | index | NULL | bl_to | 255 | NULL | 319493 | Using temporary; Using filesort | +-------------+-------+---------------+-------+---------+------+--------+---------------------------------+
The problem is that it has to count up *every* row, then sort based on the counts just to return the top 50. The grouping may be sped up by the index, but it's still going to be a bit of work. (Just tried it, took 37.59 seconds.)
Jan has suggested having a summary table that keeps track of the counts, so we can access them directly. I'm inclined to agree; we could update it on page saves (and deletes and renames and undeletes!) to save the bother of the group/count/sort behemoth. Something like:
CREATE TABLE wantedpages ( wanted_title varchar(255) binary NOT NULL default '', wanted_count int unsigned NOT NULL default 0, UNIQUE KEY wanted_title, INDEX wanted_count );
Index on the count, certainly, to speed the sort for the most wanted display. Index on the title should speed picking the right ones on updates, right?
Anyway, this would turn the most wanted query into:
SELECT wanted_title,wanted_count FROM wantedpages WHERE wanted_count > 1 ORDER BY wanted_count DESC LIMIT 0, 50;
which should be quite a bit faster.
I suspect we can be smarter about how we update links and brokenlinks as well; currently we just delete every row pertaining to the page we're dealing with and re-insert it.
-- brion vibber (brion @ pobox.com)
On Sat, Feb 01, 2003 at 09:29:24AM -0800, Brion Vibber wrote:
Jan has suggested having a summary table that keeps track of the counts, so we can access them directly. I'm inclined to agree; we could update it on page saves (and deletes and renames and undeletes!) to save the bother of the group/count/sort behemoth.
Yes. Every time you update the brokenlinks table you also have to update the wantedpages table, so that shouldn't be too hard to find in the code.
Something like:
CREATE TABLE wantedpages ( wanted_title varchar(255) binary NOT NULL default '', wanted_count int unsigned NOT NULL default 0, UNIQUE KEY wanted_title, INDEX wanted_count );
Index on the count, certainly, to speed the sort for the most wanted display. Index on the title should speed picking the right ones on updates, right?
Exactly. Btw., shouldn't namespace be in there somehwere or is that still a part of the title?
I suspect we can be smarter about how we update links and brokenlinks as well; currently we just delete every row pertaining to the page we're dealing with and re-insert it.
Probably, because when a page is saved you have to retrieve the "old" broken links from the DB anyway because there is no other way of knowing them and you need them to updated the wantedpages table. So you might as well compare them to the new list of broken links and remove/add only what has changed, which is usually only a few links anyway.
-- Jan Hidders
On sab, 2003-02-01 at 10:31, Jan Hidders wrote:
Btw., shouldn't namespace be in there somehwere or is that still a part of the title?
On the link tables, the title strings include the (text) namespace. That might not be the best way about it, though.
(For that matter, the links table should probably just be cur_id-to-cur_id, so we can grab the titles & namespaces on a join. Or is it useful to store it separately?)
I suspect we can be smarter about how we update links and brokenlinks as well; currently we just delete every row pertaining to the page we're dealing with and re-insert it.
Probably, because when a page is saved you have to retrieve the "old" broken links from the DB anyway because there is no other way of knowing them and you need them to updated the wantedpages table. So you might as well compare them to the new list of broken links and remove/add only what has changed, which is usually only a few links anyway.
Sounds reasonable.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org