Re: [Wikitech-l] Schema change : category redirects

1 Jul 2008

      On Tue, Jul 1, 2008 at 10:10 AM, Nicolas Dumazet nicdumz@gmail.com wrote:
...
When A redirects to B, all A' category_links entries will share the
same cl_final field. Being quite unexperienced, and very naive, I may
miss something important here... but I think that it makes much more
sense to add that shared value to the category table, instead of
duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the
category_links table, we should perhaps add a cat_final field to the
category table pointing to the final category it belongs to (see [4]
).

Use cases #2 #3 and #4 :

Trivial. Change  cat_final in one category table row.

Use case #5 :

Damn. Tricky. Alter TWO category table rows :p

Use case #6 :

Easy, fetch the corresponding cat_id to fill cl_to.

Use case #7 :

The most expensive operation for this proposal. You have to join
category_links and category (hopefully on cat_id) to retrieve what was
cl_final in the first proposition, and select ... where cat_final =
The important reason for favoring this way of doing it is that use
case 7 is almost certainly the most common one.  It's rare that
categories will need to be moved or redirected, which is basically
what cases 1-5 are.  Case 6 is moderately common but fast for both
scenarios, so we can ignore it.  The question is, do we want to
optimize the rare cases 1-5, or the common case 7?  The answer is the
latter.
Moreover, case 7 is much worse than you think it is, for this
proposal.  For large categories retrieved in sorted order, we must use
an index for sorting.  That requires that the entire result set be
ordered according to the index.  Currently we have an index on (cl_to,
cl_sortkey, cl_from).  Then the query SELECT ... WHERE cl_from = X
ORDER BY cl_sortkey will be able to retrieve in index order: ascending
cl_to, followed by ascending cl_sortkey in the event of a tie,
followed by ascending cl_from in the event of another tie.  (So really
it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant
and ordering by it does nothing, while ordering by cl_from is
incidental and so we don't specify it in the query.)
If we add cl_final, then we'll put an index on (cl_final, cl_sortkey)
or similar (possibly dropping an existing index and/or with other
stuff on the end).  Then the query will be WHERE cl_final = X ORDER BY
cl_sortkey, which will use the index.
On the other hand, with cat_final, we can't use an index for sorting.
The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY
cl_sortkey.  There are two possible retrieval orders here: retrieve
from categorylinks, then category, or vice versa.  Retrieving from
category first will get us some unknown number of rows, and then we
would have to join to categorylinks using an index on (cl_to).  But
even if that index is actually (cl_to, cl_sortkey), we're going to be
retrieving in cl_to order, and order by cl_sortkey only in the case of
a tie.
In this case, unlike in the previous one, we have multiple cl_to
values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY
cl_to, cl_sortkey.  The range scan on cl_to makes it impossible to use
the rest of the index for sorting, so it would be necessary to
retrieve and sort the entire contents of the category in the
categorylinks table.  This is unacceptable for performance even as an
occasional thing, for very large categories, and it's far from
occasional here: it will occur on every category page view.
Some thought will show that without cross-table indexes, there's no
way to use the index for sorting without denormalizing and copying
cat_final into a cl_final column.  Since it's unacceptable to not use
the index for sorting, this is the only solution available to us.  You
should still have a cat_final (or whatever it will be called, _final
maybe isn't the most descriptive name), but it needs to be copied into
the categorylinks table.  This means that moving very large categories
will probably have to be put on the job queue.  This kind of
performance-mandated restriction is much more acceptable than
restrictions on viewing category pages!
Some reading:
http://dev.mysql.com/doc/refman/4.1/en/order-by-optimization.html
In fact, I'd recommend reading the entire chapter 7 of the MySQL
manual, repeatedly, if you're interested in doing database work.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Schema change : category redirects