Hello !
Following the latest thread on this [0] (and the older [1]) about implementation details for my GsoC project, aiming at adding category redirects / moves, I would need more input on the DB schema change that was being discussed, knowing that :
* We expect category moves to be reversible * The only "feature" we are in fact adding is category redirects * I am considering this schema change as a good opportunity to fix bug #13579 [Categorylinks table should use category ID rather than category name in cl_to field] (please comment overthere if you see specific issues for that change)
While thinking about that schema change, I considered the following use cases, considering categories A B and C : 1)Move existing A to empty B. 2)A and B contain pages : Redirect A to B. 3)A redirects to B : make A redirect to C 4)A redirects to B : undo the redirect, make A and B plain categories 5)A redirects to B : invert the redirect, make B redirect to A 6)On page edits, alter the category_links table 7)Listing pages that belong to a specific category
Use case #1 is fairly easy to implement if cl_to points to a cat_id : you only have to rename the cat_title of the corresponding category table row, leaving the cat_id unchanged, and that's in fact the reason for switching cl_to' type
== Original idea : add a cl_final field ==
The original idea, from the previous threads, was to change the category_links table : instead of only having the single cl_to pointing to the cat_id of the included category, add a cl_final integer field, also pointing to a cat_id, but this time the cat_id of the final category (see [2] ) : in other words, when category A redirects to category B, if a page includes category A, its category_links row' cl_to will point to category A' cat_id, and its cl_final will point to category B' cat_id.
* Use cases #2, #3, and #4 : Expensive when A is a big category ! For each category_links row where cl_to = catA_cat_id, update cl_final, to respectively, B' cat_id, C' cat_id, and A' cat_id.
* Use case #5 : Very expensive, for large A and B: you have to update every A and B category_links.
* Use case #6 : When adding a category to a page, you have to fetch its corresponding cat_id for cl_to, (fairly easy), but you also have to fetch the right cl_final. You have to know first, if the category is a redirect. And if I'm right, with that schema, the only ways to tell this actually, are to retrieve the corresponding page_is_redirect in the page table, or to check for an entry in the redirect table. I believe that this forces us to compute a page-redirect join on page_id + a redirect-category join on page_title for each category title (see [3] ). If there's no results for that query, (can be caused if {1} the category page does not exist, {2} the category page exist but is not a redirect), the category is not a redirect, and else it returns us the cat_id for cl_final.
* Use case #7 : Easy. SELECT ... FROM category_links WHERE cl_final = ##
== But what about adding a cat_final field instead ? ==
When A redirects to B, all A' category_links entries will share the same cl_final field. Being quite unexperienced, and very naive, I may miss something important here... but I think that it makes much more sense to add that shared value to the category table, instead of duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the category_links table, we should perhaps add a cat_final field to the category table pointing to the final category it belongs to (see [4] ).
* Use cases #2 #3 and #4 : Trivial. Change cat_final in one category table row.
* Use case #5 : Damn. Tricky. Alter TWO category table rows :p
* Use case #6 : Easy, fetch the corresponding cat_id to fill cl_to.
* Use case #7 : The most expensive operation for this proposal. You have to join category_links and category (hopefully on cat_id) to retrieve what was cl_final in the first proposition, and select ... where cat_final = ###
Evaluating the cost of a query is clearly one of the hardest things to do for that young me, and I fail to estimate the cost of that last query. How expensive is it compared to its use frequency ? Being apparently the most expensive part for this proposed change, the answer of that question will probably state how valid was my second approach ...
I need your help to finalize the schema change needed to implement category redirects. I may also miss a third solution, even better, or have forgiven some important details while considering what has to be done : let me know :)
[0] http://lists.wikimedia.org/pipermail/wikitech-l/2008-June/038495.html [1] http://lists.wikimedia.org/pipermail/wikitech-l/2008-April/thread.html#37218 [2] http://commons.wikimedia.org/wiki/Image:Mediawiki_schema_change_for_category... [3] http://commons.wikimedia.org/wiki/Image:Mediawiki_use_case_category_redirect... [4] http://commons.wikimedia.org/wiki/Image:Mediawiki_schema_change_for_category...
On Tue, Jul 1, 2008 at 10:10 AM, Nicolas Dumazet nicdumz@gmail.com wrote:
When A redirects to B, all A' category_links entries will share the same cl_final field. Being quite unexperienced, and very naive, I may miss something important here... but I think that it makes much more sense to add that shared value to the category table, instead of duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the category_links table, we should perhaps add a cat_final field to the category table pointing to the final category it belongs to (see [4] ).
- Use cases #2 #3 and #4 :
Trivial. Change cat_final in one category table row.
- Use case #5 :
Damn. Tricky. Alter TWO category table rows :p
- Use case #6 :
Easy, fetch the corresponding cat_id to fill cl_to.
- Use case #7 :
The most expensive operation for this proposal. You have to join category_links and category (hopefully on cat_id) to retrieve what was cl_final in the first proposition, and select ... where cat_final =
The important reason for favoring this way of doing it is that use case 7 is almost certainly the most common one. It's rare that categories will need to be moved or redirected, which is basically what cases 1-5 are. Case 6 is moderately common but fast for both scenarios, so we can ignore it. The question is, do we want to optimize the rare cases 1-5, or the common case 7? The answer is the latter.
Moreover, case 7 is much worse than you think it is, for this proposal. For large categories retrieved in sorted order, we must use an index for sorting. That requires that the entire result set be ordered according to the index. Currently we have an index on (cl_to, cl_sortkey, cl_from). Then the query SELECT ... WHERE cl_from = X ORDER BY cl_sortkey will be able to retrieve in index order: ascending cl_to, followed by ascending cl_sortkey in the event of a tie, followed by ascending cl_from in the event of another tie. (So really it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant and ordering by it does nothing, while ordering by cl_from is incidental and so we don't specify it in the query.)
If we add cl_final, then we'll put an index on (cl_final, cl_sortkey) or similar (possibly dropping an existing index and/or with other stuff on the end). Then the query will be WHERE cl_final = X ORDER BY cl_sortkey, which will use the index.
On the other hand, with cat_final, we can't use an index for sorting. The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY cl_sortkey. There are two possible retrieval orders here: retrieve from categorylinks, then category, or vice versa. Retrieving from category first will get us some unknown number of rows, and then we would have to join to categorylinks using an index on (cl_to). But even if that index is actually (cl_to, cl_sortkey), we're going to be retrieving in cl_to order, and order by cl_sortkey only in the case of a tie.
In this case, unlike in the previous one, we have multiple cl_to values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY cl_to, cl_sortkey. The range scan on cl_to makes it impossible to use the rest of the index for sorting, so it would be necessary to retrieve and sort the entire contents of the category in the categorylinks table. This is unacceptable for performance even as an occasional thing, for very large categories, and it's far from occasional here: it will occur on every category page view.
Some thought will show that without cross-table indexes, there's no way to use the index for sorting without denormalizing and copying cat_final into a cl_final column. Since it's unacceptable to not use the index for sorting, this is the only solution available to us. You should still have a cat_final (or whatever it will be called, _final maybe isn't the most descriptive name), but it needs to be copied into the categorylinks table. This means that moving very large categories will probably have to be put on the job queue. This kind of performance-mandated restriction is much more acceptable than restrictions on viewing category pages!
Some reading:
http://dev.mysql.com/doc/refman/4.1/en/order-by-optimization.html
In fact, I'd recommend reading the entire chapter 7 of the MySQL manual, repeatedly, if you're interested in doing database work.
Simetrical schreef:
Moreover, case 7 is much worse than you think it is, for this proposal. For large categories retrieved in sorted order, we must use an index for sorting. That requires that the entire result set be ordered according to the index. Currently we have an index on (cl_to, cl_sortkey, cl_from). Then the query SELECT ... WHERE cl_from = X ORDER BY cl_sortkey will be able to retrieve in index order: ascending cl_to, followed by ascending cl_sortkey in the event of a tie, followed by ascending cl_from in the event of another tie. (So really it's like "ORDER BY cl_to, cl_sortkey, cl_from", but cl_to is constant and ordering by it does nothing, while ordering by cl_from is incidental and so we don't specify it in the query.)
If we add cl_final, then we'll put an index on (cl_final, cl_sortkey) or similar (possibly dropping an existing index and/or with other stuff on the end). Then the query will be WHERE cl_final = X ORDER BY cl_sortkey, which will use the index.
On the other hand, with cat_final, we can't use an index for sorting. The query will be WHERE cl_to=cat_id AND cat_final=X ORDER BY cl_sortkey. There are two possible retrieval orders here: retrieve from categorylinks, then category, or vice versa. Retrieving from category first will get us some unknown number of rows, and then we would have to join to categorylinks using an index on (cl_to). But even if that index is actually (cl_to, cl_sortkey), we're going to be retrieving in cl_to order, and order by cl_sortkey only in the case of a tie.
In this case, unlike in the previous one, we have multiple cl_to values, so our ORDER BY cl_sortkey is *not* the same as ORDER BY cl_to, cl_sortkey. The range scan on cl_to makes it impossible to use the rest of the index for sorting, so it would be necessary to retrieve and sort the entire contents of the category in the categorylinks table. This is unacceptable for performance even as an occasional thing, for very large categories, and it's far from occasional here: it will occur on every category page view.
Some thought will show that without cross-table indexes, there's no way to use the index for sorting without denormalizing and copying cat_final into a cl_final column. Since it's unacceptable to not use the index for sorting, this is the only solution available to us. You should still have a cat_final (or whatever it will be called, _final maybe isn't the most descriptive name), but it needs to be copied into the categorylinks table. This means that moving very large categories will probably have to be put on the job queue. This kind of performance-mandated restriction is much more acceptable than restrictions on viewing category pages!
Argh, of course, sorting! Using cl_final is obviously the only way to avoid lethal filesorts.
A use case NicDumz seems to have forgotten about:
8) Listing categories a certain page is in
Of course it isn't affected by the schema change, regardless of whether option #1 or #2 is chosen: you'll just do SELECT cl_to FROM categorylinks WHERE cl_from=123 which will return a list of 'original' categories the page is in (i.e. some of those might be redirects), but we probably won't want to resolve redirects in this case anyway. I thought I'd mention it for completeness.
And yeah, we will need to put large updates to categorylinks in the job queue, which probably means new code in the Job class.
Roan Kattouw (Catrope)
Nicolas Dumazet schreef:
== Original idea : add a cl_final field ==
<snip>
- Use case #6 :
When adding a category to a page, you have to fetch its corresponding cat_id for cl_to, (fairly easy), but you also have to fetch the right cl_final. You have to know first, if the category is a redirect. And if I'm right, with that schema, the only ways to tell this actually, are to retrieve the corresponding page_is_redirect in the page table, or to check for an entry in the redirect table. I believe that this forces us to compute a page-redirect join on page_id + a redirect-category join on page_title for each category title (see [3] ). If there's no results for that query, (can be caused if {1} the category page does not exist, {2} the category page exist but is not a redirect), the category is not a redirect, and else it returns us the cat_id for cl_final.
You wouldn't need to go through the page table here if you added a cat_page field to the category table. It's not a big deal, though, because joining on page_namespace=14 AND page_title=cat_title is pretty cheap.
- Use case #7 :
Easy. SELECT ... FROM category_links WHERE cl_final = ##
== But what about adding a cat_final field instead ? ==
When A redirects to B, all A' category_links entries will share the same cl_final field. Being quite unexperienced, and very naive, I may miss something important here... but I think that it makes much more sense to add that shared value to the category table, instead of duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the category_links table, we should perhaps add a cat_final field to the category table pointing to the final category it belongs to (see [4] ).
Sounds like a quite sane idea to me.
- Use cases #2 #3 and #4 :
Trivial. Change cat_final in one category table row.
- Use case #5 :
Damn. Tricky. Alter TWO category table rows :p
- Use case #6 :
Easy, fetch the corresponding cat_id to fill cl_to.
- Use case #7 :
The most expensive operation for this proposal. You have to join category_links and category (hopefully on cat_id)
Can be done on cat_title=cl_to too, doesn't really matter. Joining on cat_id is cleaner of course, but I don't think it'll be any faster.
to retrieve what was cl_final in the first proposition, and select ... where cat_final = ###
I don't see how this is a big deal. The queries would be: SELECT cat_id FROM category WHERE cat_title='Foo' LIMIT 1 SELECT cl_from FROM categorylinks JOIN category ON cl_to=cat_title WHERE cat_final=123 (here 123 is the result from the first query) The necessary indices are in place, and assuming that we'll add a KEY cat_final (cat_final) along with the cat_final field (it would be very stupid not to), this should be a rather cheap query.
Evaluating the cost of a query is clearly one of the hardest things to do for that young me, and I fail to estimate the cost of that last query. How expensive is it compared to its use frequency ? Being apparently the most expensive part for this proposed change, the answer of that question will probably state how valid was my second approach ...
In my opinion, the benefit of having a fixed and small (1 or 2) number of UPDATEs rather than a potentially huge one when changing stuff outweighs the cost of having a slightly more complex query which for lookup which, in my opinion, is still quite cheap. There are quite a few joins like those in core already (most against page, though, but that doesn't really matter), so I don't think it'll cause any problems.
I need your help to finalize the schema change needed to implement category redirects. I may also miss a third solution, even better, or have forgiven some important details while considering what has to be done : let me know :)
I'm not satisfied until a performance expert like Simetrical or Domas has cast their verdict, but option #2 looks like the way to go to me.
Roan Kattouw (Catrope)
On Tue, Jul 1, 2008 at 11:22 AM, Roan Kattouw roan.kattouw@home.nl wrote:
You wouldn't need to go through the page table here if you added a cat_page field to the category table.
Categories are not guaranteed to have associated pages, so we must keep cat_title. In that case cat_page is redundant and denormalized, and should only be added if there's some specific performance benefit to it, which there's not for any application I've heard.
Can be done on cat_title=cl_to too, doesn't really matter. Joining on cat_id is cleaner of course, but I don't think it'll be any faster.
Joining on integers is considerably faster than joining on VARCHARs. In InnoDB, joining on primary keys is considerably faster than joining on anything else. These shouldn't be neglected, in general. However, the speed of the join is not the limiting factor here, the problem is you'll have to filesort the entire category.
Simetrical schreef:
On Tue, Jul 1, 2008 at 11:22 AM, Roan Kattouw roan.kattouw@home.nl wrote:
You wouldn't need to go through the page table here if you added a cat_page field to the category table.
Categories are not guaranteed to have associated pages, so we must keep cat_title. In that case cat_page is redundant and denormalized, and should only be added if there's some specific performance benefit to it, which there's not for any application I've heard.
I wasn't suggesting ditching cat_title. The fact that cat_page is sometimes 0 doesn't really matter, since categories without corresponding pages can't be redirects, so they don't have a redirect target (which is what the query was about). I believe there actually is a performance benefit to introducing cat_page, explained below.
Can be done on cat_title=cl_to too, doesn't really matter. Joining on cat_id is cleaner of course, but I don't think it'll be any faster.
Joining on integers is considerably faster than joining on VARCHARs. In InnoDB, joining on primary keys is considerably faster than joining on anything else. These shouldn't be neglected, in general.
I had the idea joining on integers would be faster, I guess I underestimated how much. I didn't know about the primary key thing, but it makes sense (it's called *primary* key for a reason).
However, the speed of the join is not the limiting factor here, the problem is you'll have to filesort the entire category.
With the schema I was backing, yes. However, schema #1 doesn't eliminate the need to check for a category's redirect target when adding a page to it, and since joining on cat_page=page_id is faster than joining on cat_title=page_title (because of the int vs. varchar and primary vs. non-primary issues), that would constitute a "specific performance benefit" for adding cat_page, wouldn't it?
Roan Kattouw (Catrope)
On Tue, Jul 1, 2008 at 12:02 PM, Roan Kattouw roan.kattouw@home.nl wrote:
I had the idea joining on integers would be faster, I guess I underestimated how much. I didn't know about the primary key thing, but it makes sense (it's called *primary* key for a reason).
The primary key benefits are specific to InnoDB, since it clusters the table data in the primary key (basically the table data is in the leaves of the primary key B-tree, if I understand right). A primary key lookup is therefore one B-tree lookup instead of two. In MyISAM, and in many other DBMSes, the primary key isn't special.
With the schema I was backing, yes. However, schema #1 doesn't eliminate the need to check for a category's redirect target when adding a page to it, and since joining on cat_page=page_id is faster than joining on cat_title=page_title (because of the int vs. varchar and primary vs. non-primary issues), that would constitute a "specific performance benefit" for adding cat_page, wouldn't it?
Not a big enough one. Adding an extra column means every row is that much larger, reducing key buffer efficiency and thereby hurting performance slightly for all queries on the table. And if you're going to join on it you also may need an extra index (depending on join direction), which takes time to maintain on every insert and delete and also competes for the cache. Plus you get the headache of denormalization.
In this case it's almost certainly not worth it. In the case of cl_final, where you're avoiding cripplingly large filesorts, it's definitely worth it, because otherwise the feature is completely untenable.
Thanks, both of you, for your in-depth answers. It helped a lot :)
I have started a branch (category-redirects) to work on this. So far, category redirects are working, and I've created a new Job subclass to handle categorylinks big changes after a category move. Category moves are apparently working, as far as the target category (not category page) does not exist.
I am, however, considering to disallow category moves over an existing category (again, the *category*, not its attached page) : * When the user does this, he probably expects the categories to be merged together. * However as said before, the move has to be reversible: the proper way to do this is to redirect the former category to the latter; reversing this is only deleting a redirect. * But actually, after that redirect, there will be one page for two category objects. If we merged A and B to B, category A will redirect to category B, while the only valid Page will be B. That might not be a big problem, but it differs from the structure I currently use : 1 Category object <-> 1 Page object * It gets worse if you consider a third Category C, that was redirecting to A before the move : The user moves/merges A and B, so he expects C to redirect to the merged categories. However, after the move, we'll only have a redirect chain : C->A->B. And like double redirects for pages, it wont work. I don't see a way to resolve this use case in a reversible manner with my new schema.
Overriding move would be forbidden, and the user would have to create a redirect. Do you see a case where the user would suffer from creating a redirect instead of doing a hard, plain move ?
There is a little problem with that constraint, though : if a category once contained pages, it will still exist. It means that the user can be asked not to move a category to an... empty category. The easy answer here is "just check for category membership, and allow the move when no categorylinks entries point to the target category"; however, how will we deal with CategoryLinksUpdate Jobs that might be pending ? In other words, the target category might appear empty at the query time, when a move / redirect change is undergo. Should I try to address that ? If on moves, I make the job_title the target category title, searching the job table for job_cmd=categoryLinksUpdate AND job_title=target could help, but how does it sound ?
2008/7/1 Simetrical Simetrical+wikilist@gmail.com:
On Tue, Jul 1, 2008 at 12:02 PM, Roan Kattouw roan.kattouw@home.nl wrote:
I had the idea joining on integers would be faster, I guess I underestimated how much. I didn't know about the primary key thing, but it makes sense (it's called *primary* key for a reason).
The primary key benefits are specific to InnoDB, since it clusters the table data in the primary key (basically the table data is in the leaves of the primary key B-tree, if I understand right). A primary key lookup is therefore one B-tree lookup instead of two. In MyISAM, and in many other DBMSes, the primary key isn't special.
With the schema I was backing, yes. However, schema #1 doesn't eliminate the need to check for a category's redirect target when adding a page to it, and since joining on cat_page=page_id is faster than joining on cat_title=page_title (because of the int vs. varchar and primary vs. non-primary issues), that would constitute a "specific performance benefit" for adding cat_page, wouldn't it?
Not a big enough one. Adding an extra column means every row is that much larger, reducing key buffer efficiency and thereby hurting performance slightly for all queries on the table. And if you're going to join on it you also may need an extra index (depending on join direction), which takes time to maintain on every insert and delete and also competes for the cache. Plus you get the headache of denormalization.
In this case it's almost certainly not worth it. In the case of cl_final, where you're avoiding cripplingly large filesorts, it's definitely worth it, because otherwise the feature is completely untenable.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Jul 7, 2008 at 6:57 AM, Nicolas Dumazet nicdumz@gmail.com wrote:
I am, however, considering to disallow category moves over an existing category (again, the *category*, not its attached page) :
- When the user does this, he probably expects the categories to be
merged together.
- However as said before, the move has to be reversible: the proper
way to do this is to redirect the former category to the latter; reversing this is only deleting a redirect.
- But actually, after that redirect, there will be one page for two
category objects. If we merged A and B to B, category A will redirect to category B, while the only valid Page will be B. That might not be a big problem, but it differs from the structure I currently use : 1 Category object <-> 1 Page object
So you're saying that basically, just as you can't move a page over an existing page, but have to manually create the redirect, so too here you can't move a category over an existing category, but have to manually create a redirect. If I got that right, it seems reasonable to me.
There is a little problem with that constraint, though : if a category once contained pages, it will still exist. It means that the user can be asked not to move a category to an... empty category. The easy answer here is "just check for category membership, and allow the move when no categorylinks entries point to the target category"; however, how will we deal with CategoryLinksUpdate Jobs that might be pending ? In other words, the target category might appear empty at the query time, when a move / redirect change is undergo. Should I try to address that ? If on moves, I make the job_title the target category title, searching the job table for job_cmd=categoryLinksUpdate AND job_title=target could help, but how does it sound ?
If the database query is that simple, and appropriate indexes are available, it shouldn't be a problem to use that as a criterion. Don't rely on cat_count, by the way, to determine whether the category is empty. SELECT 1 FROM categorylinks WHERE ... LIMIT 1 is just as fast and considerably more accurate (if combined with the job queue check). Of course you could always have race conditions of various sorts unless you take out nasty locks, so hopefully the world won't fall apart if you move over an existing category somehow by mistake.
wikitech-l@lists.wikimedia.org