On Tue, Jul 1, 2008 at 10:10 AM, Nicolas Dumazet nicdumz@gmail.com wrote:
When A redirects to B, all A' category_links entries will share the same cl_final field. Being quite unexperienced, and very naive, I may miss something important here... but I think that it makes much more sense to add that shared value to the category table, instead of duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the category_links table, we should perhaps add a cat_final field to the category table pointing to the final category it belongs to (see [4] ).
- Use cases #2 #3 and #4 :
Trivial. Change cat_final in one category table row.
- Use case #5 :
Damn. Tricky. Alter TWO category table rows :p
- Use case #6 :
Easy, fetch the corresponding cat_id to fill cl_to.
- Use case #7 :
The most expensive operation for this proposal. You have to join category_links and category (hopefully on cat_id) to retrieve what was cl_final in the first proposition, and select ... where cat_final =
The important reason for favoring this way of doing it is that use case 7 is almost certainly the most common one. It's rare that categories will need to be moved or redirected, which is basically what cases 1-5 are. Case 6 is moderately common but fast for both scenarios, so we can ignore it. The question is, do we want to optimize the rare cases 1-5, or the common case 7? The answer is the latter.
Moreover, case 7 is much worse than you think it is, for this proposal. For large categories retrieved in sorted order, we must use an index for sorting. That requires that the entire result set be ordered according to the index. Currently we have an index on (cl_to, cl_sortkey, cl_from). Then the query SELECT ... WHERE cl_from = X ORDER BY cl_sortkey will be able to retrieve in index order: ascending cl_to, followed by ascending cl_sortkey in the event of a tie, followed by ascending cl_from in the event of another tie. (So really it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant and ordering by it does nothing, while ordering by cl_from is incidental and so we don't specify it in the query.)
If we add cl_final, then we'll put an index on (cl_final, cl_sortkey) or similar (possibly dropping an existing index and/or with other stuff on the end). Then the query will be WHERE cl_final = X ORDER BY cl_sortkey, which will use the index.
On the other hand, with cat_final, we can't use an index for sorting. The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY cl_sortkey. There are two possible retrieval orders here: retrieve from categorylinks, then category, or vice versa. Retrieving from category first will get us some unknown number of rows, and then we would have to join to categorylinks using an index on (cl_to). But even if that index is actually (cl_to, cl_sortkey), we're going to be retrieving in cl_to order, and order by cl_sortkey only in the case of a tie.
In this case, unlike in the previous one, we have multiple cl_to values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY cl_to, cl_sortkey. The range scan on cl_to makes it impossible to use the rest of the index for sorting, so it would be necessary to retrieve and sort the entire contents of the category in the categorylinks table. This is unacceptable for performance even as an occasional thing, for very large categories, and it's far from occasional here: it will occur on every category page view.
Some thought will show that without cross-table indexes, there's no way to use the index for sorting without denormalizing and copying cat_final into a cl_final column. Since it's unacceptable to not use the index for sorting, this is the only solution available to us. You should still have a cat_final (or whatever it will be called, _final maybe isn't the most descriptive name), but it needs to be copied into the categorylinks table. This means that moving very large categories will probably have to be put on the job queue. This kind of performance-mandated restriction is much more acceptable than restrictions on viewing category pages!
Some reading:
http://dev.mysql.com/doc/refman/4.1/en/order-by-optimization.html
In fact, I'd recommend reading the entire chapter 7 of the MySQL manual, repeatedly, if you're interested in doing database work.