Schema change : category redirects - Wikitech-l

1 Jul 2008


      Hello !
Following the latest thread on this [0] (and the older [1]) about
implementation details for my GsoC project, aiming at adding category
redirects / moves, I would need more input on the DB schema change
that was being discussed, knowing that :
* We expect category moves to be reversible
* The only "feature" we are in fact adding is category redirects
* I am considering this schema change as a good opportunity to fix bug
#13579 [Categorylinks table should use category ID rather than
category name in cl_to field] (please comment overthere if you see
specific issues for that change)
While thinking about that schema change, I considered the following
use cases, considering categories A B and C :
1)Move existing A to empty B.
2)A and B contain pages  : Redirect A to B.
3)A redirects to B : make A redirect to C
4)A redirects to B : undo the redirect, make A and B plain categories
5)A redirects to B : invert the redirect, make B redirect to A
6)On page edits, alter the category_links table
7)Listing pages that belong to a specific category
Use case #1 is fairly easy to implement if cl_to points to a cat_id :
you only have to rename the cat_title of the corresponding category
table row, leaving the cat_id unchanged, and that's in fact the reason
for switching cl_to' type
== Original idea : add a cl_final field ==
The original idea, from the previous threads, was to change the
category_links table : instead of only having the single cl_to
pointing to the cat_id of the included category, add a cl_final
integer field, also pointing to a cat_id, but this time the cat_id of
the final category (see [2] ) : in other words, when category A
redirects to category B, if a page includes category A, its
category_links row' cl_to will point to category A' cat_id, and its
cl_final will point to category B' cat_id.
* Use cases #2, #3, and #4 :
Expensive when A is a big category !  For each category_links row
where cl_to = catA_cat_id, update cl_final, to respectively, B'
cat_id, C' cat_id, and A' cat_id.
* Use case #5 :
Very expensive, for large A and B: you have to update every A and B
category_links.
* Use case #6 :
When adding a category to a page, you have to fetch its corresponding
cat_id for cl_to, (fairly easy), but you also have to fetch the right
cl_final.
You have to know first, if the category is a redirect. And if I'm
right, with that schema, the only ways to tell this actually, are to
retrieve the corresponding page_is_redirect in the page table, or to
check for an entry in the redirect table.
I believe that this forces us to compute a page-redirect join on
page_id + a redirect-category join on page_title for each category
title (see [3] ). If there's no results for that query, (can be caused
if {1} the category page does not exist, {2} the category page exist
but is not a redirect), the category is not a redirect, and else it
returns us the cat_id for cl_final.
* Use case #7 :
Easy. SELECT ... FROM category_links WHERE cl_final = ##
== But what about adding a cat_final field instead ? ==
When A redirects to B, all A' category_links entries will share the
same cl_final field. Being quite unexperienced, and very naive, I may
miss something important here... but I think that it makes much more
sense to add that shared value to the category table, instead of
duplicating it times the number of pages included in the category.
I'm saying that instead of adding a cl_final field to the
category_links table, we should perhaps add a cat_final field to the
category table pointing to the final category it belongs to (see [4]
).
* Use cases #2 #3 and #4 :
Trivial. Change  cat_final in one category table row.
* Use case #5 :
Damn. Tricky. Alter TWO category table rows :p
* Use case #6 :
Easy, fetch the corresponding cat_id to fill cl_to.
* Use case #7 :
The most expensive operation for this proposal. You have to join
category_links and category (hopefully on cat_id) to retrieve what was
cl_final in the first proposition, and select ... where cat_final =
###
Evaluating the cost of a query is clearly one of the hardest things to
do for that young me, and I fail to estimate the cost of that last
query. How expensive is it compared to its use frequency ? Being
apparently the most expensive part for this proposed change, the
answer of that question will probably state how valid was my second
approach ...
I need your help to finalize the schema change needed to implement
category redirects. I may also miss a third solution, even better, or
have forgiven some important details while considering what has to be
done : let me know :)
[0] http://lists.wikimedia.org/pipermail/wikitech-l/2008-June/038495.html
[1] http://lists.wikimedia.org/pipermail/wikitech-l/2008-April/thread.html#37218
[2] http://commons.wikimedia.org/wiki/Image:Mediawiki_schema_change_for_category...
[3] http://commons.wikimedia.org/wiki/Image:Mediawiki_use_case_category_redirect...
[4] http://commons.wikimedia.org/wiki/Image:Mediawiki_schema_change_for_category...
-- 
Nicolas Dumazet — NicDumZ [ nIk.d̪ymz ]