On Mon, Sep 15, 2008 at 8:54 PM, Darren Hardy <dhardy(a)bren.ucsb.edu> wrote:
I think that the problem, at least with this specific
query, is with
the query structure. page_namespace and rev_user are both already
indexed in the page and revision tables, but the query planner doesn't
optimize the join "correctly" (maybe coz the page_namespace index is
long). Nevertheless, the query is slow because there isn't a user_page
table and you have to dig in the huge revision table, so if you
explicitly dump all the rev_page data into a derived table that might
help your query speed.
Well, to begin with, who says this query is slow?
mysql> SELECT COUNT(*), page_namespace FROM page JOIN revision ON
page_id = rev_page WHERE rev_user = 158371 GROUP BY page_namespace
ORDER BY page_namespace;
+----------+----------------+
| COUNT(*) | page_namespace |
+----------+----------------+
| 2893 | 0 |
| 478 | 1 |
| 578 | 2 |
| 628 | 3 |
| 1925 | 4 |
| 699 | 5 |
| 42 | 6 |
| 14 | 7 |
| 85 | 9 |
| 275 | 10 |
| 116 | 11 |
| 1 | 12 |
| 104 | 14 |
| 17 | 15 |
| 5 | 100 |
+----------+----------------+
15 rows in set (0.17 sec)
That's certainly fast enough for an occasional statistics-gathering query.
I don't see why you think it's not optimizing the join correctly. The
execution procedure is to scan through the appropriate range in the
user_timestamp index on the revision table, and join each retrieved
row to the primary key of the page table, maintaining in-memory counts
(in a temporary table) of how many edits there are for each namespace.
I don't see what a superior execution strategy would be here.
What relevance could an index on page_namespace have? Are you
suggesting that the query should try selecting from the page table
first, and joining to the revision table?
For example, try this alternate version:
SELECT COUNT(*), p.page_namespace
FROM page p
JOIN (SELECT DISTINCT rev_page FROM revision WHERE rev_user = XXXXX) r
ON (r.rev_page = p.page_id)
GROUP BY p.page_namespace
ORDER BY p.page_namespace
It seems to be almost identical in speed, testing on enwiki_p with
user id 1385729 (BetacommandBot, 700k edits). It's doing exactly the
same thing, as far as I can tell, except that it's storing the result
of the revision query in a temporary table first -- which could cause
trouble if the table gets too large. Both take about 2.5-3 s on a hot
cache.