Is it possible to locally create some indexes specific for toolserver? As for now, it will help to seriously speed up most tools. E.g. index (rev_user,rev_page) will speed up all editcounters, which usually uses query like SELECT COUNT(*), page_namespace FROM page JOIN revision ON page_id = rev_page WHERE rev_user = 37880 GROUP BY page_namespace ORDER BY page_namespace;
Any ideas whether it would be possible and how useful would it be? --vvv
On Mon, Sep 15, 2008 at 1:48 PM, Victor Vasiliev vasilvv@gmail.com wrote:
Is it possible to locally create some indexes specific for toolserver? As for now, it will help to seriously speed up most tools. E.g. index (rev_user,rev_page) will speed up all editcounters, which usually uses query like SELECT COUNT(*), page_namespace FROM page JOIN revision ON page_id = rev_page WHERE rev_user = 37880 GROUP BY page_namespace ORDER BY page_namespace;
Why would that index speed up the access noticeably? There's already an index on (rev_user, rev_timestamp) (assuming that the toolserver has stock indexes), and rev_page is in the leaf nodes.
If optimization is needed here, a root should install the microslow patch and do an analysis of what's taking up the most time, and see if those can be indexed or otherwise improved. At any rate, an ALTER TABLE would require considerable downtime, since as far as I know there's only one toolserver database server, so you can't just throw all the load onto the other slaves while you take one out of rotation. (Or are there multiples these days?)
I think that the problem, at least with this specific query, is with the query structure. page_namespace and rev_user are both already indexed in the page and revision tables, but the query planner doesn't optimize the join "correctly" (maybe coz the page_namespace index is long). Nevertheless, the query is slow because there isn't a user_page table and you have to dig in the huge revision table, so if you explicitly dump all the rev_page data into a derived table that might help your query speed.
For example, try this alternate version:
SELECT COUNT(*), p.page_namespace FROM page p JOIN (SELECT DISTINCT rev_page FROM revision WHERE rev_user = XXXXX) r ON (r.rev_page = p.page_id) GROUP BY p.page_namespace ORDER BY p.page_namespace
*************************** 1. row *************************** id: 1 select_type: PRIMARY table: <derived2> type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 14202 Extra: Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: PRIMARY table: page type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: r.rev_page rows: 1 Extra: *************************** 3. row *************************** id: 2 select_type: DERIVED table: revision type: ref possible_keys: user_timestamp key: user_timestamp key_len: 4 ref: rows: 79640 Extra: Using where; Using index; Using temporary
-darren
On 15 September 2008, at 2:18 PM, Aryeh Gregor wrote:
On Mon, Sep 15, 2008 at 1:48 PM, Victor Vasiliev vasilvv@gmail.com wrote:
Is it possible to locally create some indexes specific for toolserver? As for now, it will help to seriously speed up most tools. E.g. index (rev_user,rev_page) will speed up all editcounters, which usually uses query like SELECT COUNT(*), page_namespace FROM page JOIN revision ON page_id = rev_page WHERE rev_user = 37880 GROUP BY page_namespace ORDER BY page_namespace;
Why would that index speed up the access noticeably? There's already an index on (rev_user, rev_timestamp) (assuming that the toolserver has stock indexes), and rev_page is in the leaf nodes.
If optimization is needed here, a root should install the microslow patch and do an analysis of what's taking up the most time, and see if those can be indexed or otherwise improved. At any rate, an ALTER TABLE would require considerable downtime, since as far as I know there's only one toolserver database server, so you can't just throw all the load onto the other slaves while you take one out of rotation. (Or are there multiples these days?)
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
explain SELECT COUNT(*), p.page_namespace FROM page p JOIN revision r ON (p.page_id = r.rev_page) WHERE r.rev_user = 37880 GROUP BY p.page_namespace ORDER BY p.page_namespace \G drh08@nightshade:~$ sql enwiki_p < test.sql *************************** 1. row *************************** id: 1 select_type: SIMPLE table: revision type: ref possible_keys: PRIMARY,page_timestamp,user_timestamp key: user_timestamp key_len: 4 ref: const rows: 1 Extra: Using index; Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: page type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: enwiki.revision.rev_page rows: 1 Extra:
On Mon, Sep 15, 2008 at 8:54 PM, Darren Hardy dhardy@bren.ucsb.edu wrote:
I think that the problem, at least with this specific query, is with the query structure. page_namespace and rev_user are both already indexed in the page and revision tables, but the query planner doesn't optimize the join "correctly" (maybe coz the page_namespace index is long). Nevertheless, the query is slow because there isn't a user_page table and you have to dig in the huge revision table, so if you explicitly dump all the rev_page data into a derived table that might help your query speed.
Well, to begin with, who says this query is slow?
mysql> SELECT COUNT(*), page_namespace FROM page JOIN revision ON page_id = rev_page WHERE rev_user = 158371 GROUP BY page_namespace ORDER BY page_namespace; +----------+----------------+ | COUNT(*) | page_namespace | +----------+----------------+ | 2893 | 0 | | 478 | 1 | | 578 | 2 | | 628 | 3 | | 1925 | 4 | | 699 | 5 | | 42 | 6 | | 14 | 7 | | 85 | 9 | | 275 | 10 | | 116 | 11 | | 1 | 12 | | 104 | 14 | | 17 | 15 | | 5 | 100 | +----------+----------------+ 15 rows in set (0.17 sec)
That's certainly fast enough for an occasional statistics-gathering query.
I don't see why you think it's not optimizing the join correctly. The execution procedure is to scan through the appropriate range in the user_timestamp index on the revision table, and join each retrieved row to the primary key of the page table, maintaining in-memory counts (in a temporary table) of how many edits there are for each namespace. I don't see what a superior execution strategy would be here.
What relevance could an index on page_namespace have? Are you suggesting that the query should try selecting from the page table first, and joining to the revision table?
For example, try this alternate version:
SELECT COUNT(*), p.page_namespace FROM page p JOIN (SELECT DISTINCT rev_page FROM revision WHERE rev_user = XXXXX) r ON (r.rev_page = p.page_id) GROUP BY p.page_namespace ORDER BY p.page_namespace
It seems to be almost identical in speed, testing on enwiki_p with user id 1385729 (BetacommandBot, 700k edits). It's doing exactly the same thing, as far as I can tell, except that it's storing the result of the revision query in a temporary table first -- which could cause trouble if the table gets too large. Both take about 2.5-3 s on a hot cache.
toolserver-l@lists.wikimedia.org