Re: [Wikitech-l] Schema change : category redirects

1 Jul 2008


      Simetrical schreef:
...
Moreover, case 7 is much worse than you think it is, for this
proposal.  For large categories retrieved in sorted order, we must use
an index for sorting.  That requires that the entire result set be
ordered according to the index.  Currently we have an index on (cl_to,
cl_sortkey, cl_from).  Then the query SELECT ... WHERE cl_from = X
ORDER BY cl_sortkey will be able to retrieve in index order: ascending
cl_to, followed by ascending cl_sortkey in the event of a tie,
followed by ascending cl_from in the event of another tie.  (So really
it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant
and ordering by it does nothing, while ordering by cl_from is
incidental and so we don't specify it in the query.)
If we add cl_final, then we'll put an index on (cl_final, cl_sortkey)
or similar (possibly dropping an existing index and/or with other
stuff on the end).  Then the query will be WHERE cl_final = X ORDER BY
cl_sortkey, which will use the index.
On the other hand, with cat_final, we can't use an index for sorting.
The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY
cl_sortkey.  There are two possible retrieval orders here: retrieve
from categorylinks, then category, or vice versa.  Retrieving from
category first will get us some unknown number of rows, and then we
would have to join to categorylinks using an index on (cl_to).  But
even if that index is actually (cl_to, cl_sortkey), we're going to be
retrieving in cl_to order, and order by cl_sortkey only in the case of
a tie.
In this case, unlike in the previous one, we have multiple cl_to
values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY
cl_to, cl_sortkey.  The range scan on cl_to makes it impossible to use
the rest of the index for sorting, so it would be necessary to
retrieve and sort the entire contents of the category in the
categorylinks table.  This is unacceptable for performance even as an
occasional thing, for very large categories, and it's far from
occasional here: it will occur on every category page view.
Some thought will show that without cross-table indexes, there's no
way to use the index for sorting without denormalizing and copying
cat_final into a cl_final column.  Since it's unacceptable to not use
the index for sorting, this is the only solution available to us.  You
should still have a cat_final (or whatever it will be called, _final
maybe isn't the most descriptive name), but it needs to be copied into
the categorylinks table.  This means that moving very large categories
will probably have to be put on the job queue.  This kind of
performance-mandated restriction is much more acceptable than
restrictions on viewing category pages!
Argh, of course, sorting! Using cl_final is obviously the only way to 
avoid lethal filesorts.
A use case NicDumz seems to have forgotten about:
8) Listing categories a certain page is in
Of course it isn't affected by the schema change, regardless of whether 
option #1 or #2 is chosen: you'll just do
SELECT cl_to FROM categorylinks WHERE cl_from=123
which will return a list of 'original' categories the page is in (i.e. 
some of those might be redirects), but we probably won't want to resolve 
redirects in this case anyway. I thought I'd mention it for completeness.
And yeah, we will need to put large updates to categorylinks in the job 
queue, which probably means new code in the Job class.
Roan Kattouw (Catrope)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Schema change : category redirects