Re: [Wikitech-l] Schema change : category redirects

1 Jul 2008

Simetrical schreef:
...
  Moreover, case 7 is much worse than you think it is,
for this
 proposal.  For large categories retrieved in sorted order, we must use
 an index for sorting.  That requires that the entire result set be
 ordered according to the index.  Currently we have an index on (cl_to,
 cl_sortkey, cl_from).  Then the query SELECT ... WHERE cl_from = X
 ORDER BY cl_sortkey will be able to retrieve in index order: ascending
 cl_to, followed by ascending cl_sortkey in the event of a tie,
 followed by ascending cl_from in the event of another tie.  (So really
 it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant
 and ordering by it does nothing, while ordering by cl_from is
 incidental and so we don't specify it in the query.)

 If we add cl_final, then we'll put an index on (cl_final, cl_sortkey)
 or similar (possibly dropping an existing index and/or with other
 stuff on the end).  Then the query will be WHERE cl_final = X ORDER BY
 cl_sortkey, which will use the index.

 On the other hand, with cat_final, we can't use an index for sorting.
 The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY
 cl_sortkey.  There are two possible retrieval orders here: retrieve
 from categorylinks, then category, or vice versa.  Retrieving from
 category first will get us some unknown number of rows, and then we
 would have to join to categorylinks using an index on (cl_to).  But
 even if that index is actually (cl_to, cl_sortkey), we're going to be
 retrieving in cl_to order, and order by cl_sortkey only in the case of
 a tie.

 In this case, unlike in the previous one, we have multiple cl_to
 values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY
 cl_to, cl_sortkey.  The range scan on cl_to makes it impossible to use
 the rest of the index for sorting, so it would be necessary to
 retrieve and sort the entire contents of the category in the
 categorylinks table.  This is unacceptable for performance even as an
 occasional thing, for very large categories, and it's far from
 occasional here: it will occur on every category page view.

 Some thought will show that without cross-table indexes, there's no
 way to use the index for sorting without denormalizing and copying
 cat_final into a cl_final column.  Since it's unacceptable to not use
 the index for sorting, this is the only solution available to us.  You
 should still have a cat_final (or whatever it will be called, _final
 maybe isn't the most descriptive name), but it needs to be copied into
 the categorylinks table.  This means that moving very large categories
 will probably have to be put on the job queue.  This kind of
 performance-mandated restriction is much more acceptable than
 restrictions on viewing category pages!
    Argh, of course, sorting! Using cl_final is obviously the only way to 
avoid lethal filesorts.

A use case NicDumz seems to have forgotten about:

8) Listing categories a certain page is in

Of course it isn't affected by the schema change, regardless of whether 
option #1 or #2 is chosen: you'll just do
SELECT cl_to FROM categorylinks WHERE cl_from=123
which will return a list of 'original' categories the page is in (i.e. 
some of those might be redirects), but we probably won't want to resolve 
redirects in this case anyway. I thought I'd mention it for completeness.

And yeah, we will need to put large updates to categorylinks in the job 
queue, which probably means new code in the Job class.

Roan Kattouw (Catrope)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Schema change : category redirects