On Mon, Sep 15, 2008 at 8:54 PM, Darren Hardy dhardy@bren.ucsb.edu wrote:
I think that the problem, at least with this specific query, is with the query structure. page_namespace and rev_user are both already indexed in the page and revision tables, but the query planner doesn't optimize the join "correctly" (maybe coz the page_namespace index is long). Nevertheless, the query is slow because there isn't a user_page table and you have to dig in the huge revision table, so if you explicitly dump all the rev_page data into a derived table that might help your query speed.
Well, to begin with, who says this query is slow?
mysql> SELECT COUNT(*), page_namespace FROM page JOIN revision ON page_id = rev_page WHERE rev_user = 158371 GROUP BY page_namespace ORDER BY page_namespace; +----------+----------------+ | COUNT(*) | page_namespace | +----------+----------------+ | 2893 | 0 | | 478 | 1 | | 578 | 2 | | 628 | 3 | | 1925 | 4 | | 699 | 5 | | 42 | 6 | | 14 | 7 | | 85 | 9 | | 275 | 10 | | 116 | 11 | | 1 | 12 | | 104 | 14 | | 17 | 15 | | 5 | 100 | +----------+----------------+ 15 rows in set (0.17 sec)
That's certainly fast enough for an occasional statistics-gathering query.
I don't see why you think it's not optimizing the join correctly. The execution procedure is to scan through the appropriate range in the user_timestamp index on the revision table, and join each retrieved row to the primary key of the page table, maintaining in-memory counts (in a temporary table) of how many edits there are for each namespace. I don't see what a superior execution strategy would be here.
What relevance could an index on page_namespace have? Are you suggesting that the query should try selecting from the page table first, and joining to the revision table?
For example, try this alternate version:
SELECT COUNT(*), p.page_namespace FROM page p JOIN (SELECT DISTINCT rev_page FROM revision WHERE rev_user = XXXXX) r ON (r.rev_page = p.page_id) GROUP BY p.page_namespace ORDER BY p.page_namespace
It seems to be almost identical in speed, testing on enwiki_p with user id 1385729 (BetacommandBot, 700k edits). It's doing exactly the same thing, as far as I can tell, except that it's storing the result of the revision query in a temporary table first -- which could cause trouble if the table gets too large. Both take about 2.5-3 s on a hot cache.