On Sun, Sep 7, 2014 at 8:11 AM, Sean Pringle <springle@wikimedia.org> wrote:

On Sun, Sep 7, 2014 at 5:58 AM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org> wrote:
The database query for that is simple enough:

SELECT /* ApiQueryCategoryMembers::run Anomie */  cl_from,cl_sortkey,cl_type,page_namespace,page_title,cl_timestamp  FROM `page`,`categorylinks`  FORCE INDEX (cl_timestamp) WHERE cl_to = 'Copy_to_Wikimedia_Commons_(bot-assessed)' AND (cl_from=page_id)  ORDER BY cl_timestamp,cl_from LIMIT 501;

And the PHP code doesn't do anything complicated either. Maybe Sean can give us more insight if there's some subtle database thing going on here.

As Nik noted, the query plan walking cl_timestamp is not ideal. Plus, even with the forced index the query requires a filesort since cl_timestamp index is on (cl_to,cl_timestamp) and not (cl_timestamp,cl_from).

We're including a constant cl_to in the query here, so the index on (cl_to,cl_timestamp) is exactly what we want.

As for ORDER BY cl_timestamp, cl_from, that's https://gerrit.wikimedia.org/r/#/c/103589/ taking advantage of InnoDB's clustered indexes where it silently appends the primary key (or in this case the first/only UNIQUE key, cl_from) to all other indexes.

When I EXPLAIN this query against enwiki, there's no filesort on master, db1055, db1051, and db1066.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_timestamp
       [key] => cl_timestamp
       [key_len] => 257
       [ref] => const
       [rows] => 635858
       [Extra] => Using index condition; Using where
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

There is on db1061, db1062, db1065, db1072, and db1073.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_timestamp
       [key] => cl_timestamp
       [key_len] => 257
       [ref] => const
       [rows] => 706656
       [Extra] => Using index condition; Using where; Using filesort
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

My wild guess would be that the latter set of databases were somehow created differently so that InnoDB is clustering using some index other than cl_from.
 
Removing the FORCE INDEX would allow cl_sortkey index to be used, with better selectivity.

On master, db1055, db1051, and db1066, removing the FORCE INDEX still reports from EXPLAIN that it chose the cl_timestamp index. It does cause EXPLAIN to stop saying "Using index condition" though.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_from,cl_timestamp,cl_sortkey
       [key] => cl_timestamp
       [key_len] => 257
       [ref] => const
       [rows] => 525652
       [Extra] => Using where
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

On db1061, db1062, db1065, db1072, and db1073, it does choose cl_sortkey but it still filesorts.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_from,cl_timestamp,cl_sortkey
       [key] => cl_sortkey
       [key_len] => 257
       [ref] => const
       [rows] => 525652
       [Extra] => Using index condition; Using where; Using filesort
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

If I also remove the cl_from from the ORDER BY, then all databases return the same query using the cl_timestamp index. But the API can't do that without reopening https://bugzilla.wikimedia.org/show_bug.cgi?id=24782.


--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation