Hi,
Is there a way to run fast queries (perhaps using indexes) using the "old" table in my replication of the English Wikipedia database (MySQL 4.0.20)? When I've tried queries that include the field "old_title", the simplest queries run for up to an hour on my AMD Athlon XP 2400+ and 512 MB RAM on Windows XP Professional. The more complex ones I'm trying to run take dozens of hours--one query ran for five days, before I finally quit it.
I've configured MySQL using the configuration options that I list at the botton of this e-mail, but things are still unbearably slow. An example of the type of query I've been trying to run would be:
select distinct old_user, old_user_text from old where old_user > 0 and old_title like 'History_of_%';
I'm surprised that this sort of query runs so slowly, considering that when on the live Wikipedia, you click on "User Contributions" and get a list of responses with lightning speed. Is there a faster way to search on all the historic changes to Wikipedia that I'm overlooking? I'd really appreciate any help anyone could give me.
Thanks,
Claudio
[mysqld] basedir=D:/www/mysql #bind-address=192.168.2.22 datadir=D:/www/mysql/data #language=D:/www/mysql/share/your language directory #slow query log#= #tmpdir#= #port=3306 set-variable=key_buffer=128M set-variable= max_allowed_packet=1M set-variable=table_cache=256 set-variable=sort_buffer_size=4M set-variable=read_buffer_size=4M set-variable=myisam_sort_buffer_size=64M set-variable = thread_cache=8 set-variable = thread_concurrency=2 set-variable =tmp_table_size=64M
set-variable=innodb_buffer_pool_size=140M set-variable=innodb_additional_mem_pool_size=10M
[mysqldump] quick set-variable = max_allowed_packet=16M
[isamchk] set-variable = key_buffer=128M set-variable = sort_buffer=128M set-variable = read_buffer=2M set-variable = write_buffer=2M
[myisamchk] set-variable = key_buffer=128M set-variable = sort_buffer=128M set-variable = read_buffer=2M set-variable = write_buffer=2M
______________________________________________________________________ Post your free ad now! http://personals.yahoo.ca
Claudio V wrote:
Is there a way to run fast queries (perhaps using indexes) using the "old" table in my replication of the English Wikipedia database (MySQL 4.0.20)?
You'll either have to make sure your query can make use of the existing indexes, or create the necessary indexes yourself.
select distinct old_user, old_user_text from old where old_user > 0 and old_title like 'History_of_%';
Use explain:
+-------+-------+--------------------------+-----------+---------+------+------+-----------------------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+-------+--------------------------+-----------+---------+------+------+-----------------------------+ | old | range | old_title,user_timestamp | old_title | 255 | NULL | 1 | where used; Using temporary | +-------+-------+--------------------------+-----------+---------+------+------+-----------------------------+
Your query is slow because it uses a temporary table.
I'm surprised that this sort of query runs so slowly, considering that when on the live Wikipedia, you click on "User Contributions" and get a list of responses with lightning speed.
mysql> explain SELECT -> old_namespace,old_title,old_timestamp,old_comment, -> old_minor_edit,old_user_text FROM old -> WHERE old_user_text='Timwi' ORDER BY inverse_timestamp -> LIMIT 100; +-------+------+--------------------+--------------------+---------+-------+------+------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+------+--------------------+--------------------+---------+-------+------+------------+ | old | ref | usertext_timestamp | usertext_timestamp | 255 | const | 199 | where used | +-------+------+--------------------+--------------------+---------+-------+------+------------+
See? No temporary table.
A major difference that I can see between your query and this one is that yours uses DISTINCT, but you'll have to experiment a bit to see if that's the only reason.
Also, type=ref with ref=const is faster than type=range, but I assume with what your query is trying to achieve you can't avoid type=range.
Timwi
Hi Timwi,
Use explain:
+-------+-------+--------------------------+-----------+---------+------+------+-----------------------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------+-------+--------------------------+-----------+---------+------+------+-----------------------------+
| old | range | old_title,user_timestamp | old_title | 255 | NULL | 1 | where used; Using temporary |
+-------+-------+--------------------------+-----------+---------+------+------+-----------------------------+
Thanks for your quick reply. How did you get an index for "old_title"? My downloaded version doesn't have this. When I tried to create one a couple weeks ago, after around 24 hours it filled up the remaining 12 GB of my hard drive and crashed, so I wasn't able to create it. Here are the indexes currently available for the table "old":
Key_name Column_name old_id old_id old_timestamp old_timestamp old_namespace old_namespace old_namespace old_title name_title_timestamp old_namespace name_title_timestamp old_title name_title_timestamp inverse_timestamp user_timestamp old_user user_timestamp inverse_timestamp usertext_timestamp old_user_text usertext_timestamp inverse_timestamp
I think if I had "old_title" indexed, things would be a lot easier. Any tips?
Thanks.
Claudio
______________________________________________________________________ Post your free ad now! http://personals.yahoo.ca
Claudio V wrote:
I think if I had "old_title" indexed, things would be a lot easier. Any tips?
You should virtually always use namespace and title together as a pair; they're indexed that way. If you only care about the article namespace, put in an 'old_namespace=0' clause.
-- brion vibber (brion @ pobox.com)
Claudio V wrote:
Thanks for your quick reply. How did you get an index for "old_title"?
[[Special:Contributions]] doesn't use one.
However, as Brion mentioned, and as evidenced by your own output as well, there is a mutli-column index on both old_title and old_namespace. You should add "and old_namespace=0" to your query. (But you should probably also remove the "DISTINCT" keyword and instead do the filtering-out of duplicates in your favourite programming language.)
Timwi
wikitech-l@lists.wikimedia.org