Hello !
Following the latest thread on this [0] (and the older [1]) about implementation details for my GsoC project, aiming at adding category redirects / moves, I would need more input on the DB schema change that was being discussed, knowing that :
* We expect category moves to be reversible * The only "feature" we are in fact adding is category redirects * I am considering this schema change as a good opportunity to fix bug #13579 [Categorylinks table should use category ID rather than category name in cl_to field] (please comment overthere if you see specific issues for that change)
While thinking about that schema change, I considered the following use cases, considering categories A B and C : 1)Move existing A to empty B. 2)A and B contain pages : Redirect A to B. 3)A redirects to B : make A redirect to C 4)A redirects to B : undo the redirect, make A and B plain categories 5)A redirects to B : invert the redirect, make B redirect to A 6)On page edits, alter the category_links table 7)Listing pages that belong to a specific category
Use case #1 is fairly easy to implement if cl_to points to a cat_id : you only have to rename the cat_title of the corresponding category table row, leaving the cat_id unchanged, and that's in fact the reason for switching cl_to' type
== Original idea : add a cl_final field ==
The original idea, from the previous threads, was to change the category_links table : instead of only having the single cl_to pointing to the cat_id of the included category, add a cl_final integer field, also pointing to a cat_id, but this time the cat_id of the final category (see [2] ) : in other words, when category A redirects to category B, if a page includes category A, its category_links row' cl_to will point to category A' cat_id, and its cl_final will point to category B' cat_id.
* Use cases #2, #3, and #4 : Expensive when A is a big category ! For each category_links row where cl_to = catA_cat_id, update cl_final, to respectively, B' cat_id, C' cat_id, and A' cat_id.
* Use case #5 : Very expensive, for large A and B: you have to update every A and B category_links.
* Use case #6 : When adding a category to a page, you have to fetch its corresponding cat_id for cl_to, (fairly easy), but you also have to fetch the right cl_final. You have to know first, if the category is a redirect. And if I'm right, with that schema, the only ways to tell this actually, are to retrieve the corresponding page_is_redirect in the page table, or to check for an entry in the redirect table. I believe that this forces us to compute a page-redirect join on page_id + a redirect-category join on page_title for each category title (see [3] ). If there's no results for that query, (can be caused if {1} the category page does not exist, {2} the category page exist but is not a redirect), the category is not a redirect, and else it returns us the cat_id for cl_final.
* Use case #7 : Easy. SELECT ... FROM category_links WHERE cl_final = ##
== But what about adding a cat_final field instead ? ==
When A redirects to B, all A' category_links entries will share the same cl_final field. Being quite unexperienced, and very naive, I may miss something important here... but I think that it makes much more sense to add that shared value to the category table, instead of duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the category_links table, we should perhaps add a cat_final field to the category table pointing to the final category it belongs to (see [4] ).
* Use cases #2 #3 and #4 : Trivial. Change cat_final in one category table row.
* Use case #5 : Damn. Tricky. Alter TWO category table rows :p
* Use case #6 : Easy, fetch the corresponding cat_id to fill cl_to.
* Use case #7 : The most expensive operation for this proposal. You have to join category_links and category (hopefully on cat_id) to retrieve what was cl_final in the first proposition, and select ... where cat_final = ###
Evaluating the cost of a query is clearly one of the hardest things to do for that young me, and I fail to estimate the cost of that last query. How expensive is it compared to its use frequency ? Being apparently the most expensive part for this proposed change, the answer of that question will probably state how valid was my second approach ...
I need your help to finalize the schema change needed to implement category redirects. I may also miss a third solution, even better, or have forgiven some important details while considering what has to be done : let me know :)
[0] http://lists.wikimedia.org/pipermail/wikitech-l/2008-June/038495.html [1] http://lists.wikimedia.org/pipermail/wikitech-l/2008-April/thread.html#37218 [2] http://commons.wikimedia.org/wiki/Image:Mediawiki_schema_change_for_category... [3] http://commons.wikimedia.org/wiki/Image:Mediawiki_use_case_category_redirect... [4] http://commons.wikimedia.org/wiki/Image:Mediawiki_schema_change_for_category...