On Sun, Sep 7, 2014 at 8:11 AM, Sean Pringle <springle(a)wikimedia.org> wrote:
On Sun, Sep 7, 2014 at 5:58 AM, Brad Jorsch (Anomie) <
bjorsch(a)wikimedia.org> wrote:
The database query for that is simple enough:
SELECT /* ApiQueryCategoryMembers::run Anomie */
cl_from,cl_sortkey,cl_type,page_namespace,page_title,cl_timestamp FROM
`page`,`categorylinks` FORCE INDEX (cl_timestamp) WHERE cl_to =
'Copy_to_Wikimedia_Commons_(bot-assessed)' AND (cl_from=page_id) ORDER BY
cl_timestamp,cl_from LIMIT 501;
And the PHP code doesn't do anything complicated either. Maybe Sean can
give us more insight if there's some subtle database thing going on here.
As Nik noted, the query plan walking cl_timestamp is not ideal. Plus, even
with the forced index the query requires a filesort since cl_timestamp
index is on (cl_to,cl_timestamp) and not (cl_timestamp,cl_from).
We're including a constant cl_to in the query here, so the index on
(cl_to,cl_timestamp) is exactly what we want.
As for ORDER BY cl_timestamp, cl_from, that's
https://gerrit.wikimedia.org/r/#/c/103589/ taking advantage of InnoDB's
clustered indexes where it silently appends the primary key (or in this
case the first/only UNIQUE key, cl_from) to all other indexes.
When I EXPLAIN this query against enwiki, there's no filesort on master,
db1055, db1051, and db1066.
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => categorylinks
[type] => ref
[possible_keys] => cl_timestamp
[key] => cl_timestamp
[key_len] => 257
[ref] => const
[rows] => 635858
[Extra] => Using index condition; Using where
)
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => page
[type] => eq_ref
[possible_keys] => PRIMARY
[key] => PRIMARY
[key_len] => 4
[ref] => enwiki.categorylinks.cl_from
[rows] => 1
[Extra] =>
)
There is on db1061, db1062, db1065, db1072, and db1073.
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => categorylinks
[type] => ref
[possible_keys] => cl_timestamp
[key] => cl_timestamp
[key_len] => 257
[ref] => const
[rows] => 706656
[Extra] => Using index condition; Using where; Using filesort
)
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => page
[type] => eq_ref
[possible_keys] => PRIMARY
[key] => PRIMARY
[key_len] => 4
[ref] => enwiki.categorylinks.cl_from
[rows] => 1
[Extra] =>
)
My wild guess would be that the latter set of databases were somehow
created differently so that InnoDB is clustering using some index other
than cl_from.
Removing the FORCE INDEX would allow cl_sortkey index to be used, with
better selectivity.
On master, db1055, db1051, and db1066, removing the FORCE INDEX still
reports from EXPLAIN that it chose the cl_timestamp index. It does cause
EXPLAIN to stop saying "Using index condition" though.
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => categorylinks
[type] => ref
[possible_keys] => cl_from,cl_timestamp,cl_sortkey
[key] => cl_timestamp
[key_len] => 257
[ref] => const
[rows] => 525652
[Extra] => Using where
)
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => page
[type] => eq_ref
[possible_keys] => PRIMARY
[key] => PRIMARY
[key_len] => 4
[ref] => enwiki.categorylinks.cl_from
[rows] => 1
[Extra] =>
)
On db1061, db1062, db1065, db1072, and db1073, it does choose cl_sortkey
but it still filesorts.
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => categorylinks
[type] => ref
[possible_keys] => cl_from,cl_timestamp,cl_sortkey
[key] => cl_sortkey
[key_len] => 257
[ref] => const
[rows] => 525652
[Extra] => Using index condition; Using where; Using filesort
)
stdClass Object
(
[id] => 1
[select_type] => SIMPLE
[table] => page
[type] => eq_ref
[possible_keys] => PRIMARY
[key] => PRIMARY
[key_len] => 4
[ref] => enwiki.categorylinks.cl_from
[rows] => 1
[Extra] =>
)
If I also remove the cl_from from the ORDER BY, then all databases return
the same query using the cl_timestamp index. But the API can't do that
without reopening
https://bugzilla.wikimedia.org/show_bug.cgi?id=24782.
--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation