Simetrical schreef:
Moreover, case 7 is much worse than you think it is, for this proposal. For large categories retrieved in sorted order, we must use an index for sorting. That requires that the entire result set be ordered according to the index. Currently we have an index on (cl_to, cl_sortkey, cl_from). Then the query SELECT ... WHERE cl_from = X ORDER BY cl_sortkey will be able to retrieve in index order: ascending cl_to, followed by ascending cl_sortkey in the event of a tie, followed by ascending cl_from in the event of another tie. (So really it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant and ordering by it does nothing, while ordering by cl_from is incidental and so we don't specify it in the query.)
If we add cl_final, then we'll put an index on (cl_final, cl_sortkey) or similar (possibly dropping an existing index and/or with other stuff on the end). Then the query will be WHERE cl_final = X ORDER BY cl_sortkey, which will use the index.
On the other hand, with cat_final, we can't use an index for sorting. The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY cl_sortkey. There are two possible retrieval orders here: retrieve from categorylinks, then category, or vice versa. Retrieving from category first will get us some unknown number of rows, and then we would have to join to categorylinks using an index on (cl_to). But even if that index is actually (cl_to, cl_sortkey), we're going to be retrieving in cl_to order, and order by cl_sortkey only in the case of a tie.
In this case, unlike in the previous one, we have multiple cl_to values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY cl_to, cl_sortkey. The range scan on cl_to makes it impossible to use the rest of the index for sorting, so it would be necessary to retrieve and sort the entire contents of the category in the categorylinks table. This is unacceptable for performance even as an occasional thing, for very large categories, and it's far from occasional here: it will occur on every category page view.
Some thought will show that without cross-table indexes, there's no way to use the index for sorting without denormalizing and copying cat_final into a cl_final column. Since it's unacceptable to not use the index for sorting, this is the only solution available to us. You should still have a cat_final (or whatever it will be called, _final maybe isn't the most descriptive name), but it needs to be copied into the categorylinks table. This means that moving very large categories will probably have to be put on the job queue. This kind of performance-mandated restriction is much more acceptable than restrictions on viewing category pages!
Argh, of course, sorting! Using cl_final is obviously the only way to avoid lethal filesorts.
A use case NicDumz seems to have forgotten about:
8) Listing categories a certain page is in
Of course it isn't affected by the schema change, regardless of whether option #1 or #2 is chosen: you'll just do SELECT cl_to FROM categorylinks WHERE cl_from=123 which will return a list of 'original' categories the page is in (i.e. some of those might be redirects), but we probably won't want to resolve redirects in this case anyway. I thought I'd mention it for completeness.
And yeah, we will need to put large updates to categorylinks in the job queue, which probably means new code in the Job class.
Roan Kattouw (Catrope)