Hello, wikitech.
I have applied to Google Summer of Code with the project to enable category moving without using bots. After some correspondance with Catrope, the following text is my project idea. Any feedback would be welcome.
Synopsis
I will provide capability of moving categories to achieve an effect for the end-user similar to that of moving other pages. Currently, contributors must apply to use a bot that recreates the category page and changes the category link on all relevant articles.
Project
The object can be divided into three parts. First, the category page is moved, along with its history, just as renaming of articles works. A redirect is optionally placed on the old category page, and the category discussion is moved as well.
Second, all articles in the relevant category must have their category links changed. There are several obstacles involved in this task: 1. Finding all alternative ways of categorizing articles. It is simple to match the simple category links and category lists, but more difficult to find e.g. categories included from a template. Roan Kattouw (Catrope) suggested category redirects for this, such that all articles categorised as [[Category:A]] would also be listed at [[Category:B]] if the prior has been redirected to the latter. 2. Articles might be in the process of being edited as the movement is done. This, however, can be solved in the same manner as edit collisions are currently solved. 3. The algorithm would likely have high complexity and would thus not scale well with very large categories. This is likely to constitute a significant and challenging part of the project.
As the last step, the relevant entries in the categorylinks table would need to be changed. This is accomplished by a simple SQL query. This could be avoided if bug #13579 [1] ("Category table should use category ID rather than category name") is fixed, which it could be as part of this project.
The project would preferably be written as a patch to the core. Catrope suggested setting up a separate SVN branch for the project, such that everyone can see my progress.
Profits for MediaWiki
Developing a means of moving categories would decrease dependency on bots, gaining in administrative time. Additionally, the solution would be faster than any bot-relying solution could be due to, among other things, the removed need of loading pages.
Category moving would also increase the consistency in layout on the different article types. The only real reason for a "move" tab not to reside on category pages is that the feature is not yet implemented.
Roadmap
Publishing this document to the MediaWiki development community (wikitech-l) and awaiting comments on the planned procedure would be the first step.
After the community bonding period specified by the time line, a week should be enough to get comfortable with the relevant MediaWiki code and implement the first section, moving the category page along with its discussion and history. Much old code should be reusable here, such as the Title::moveTo() method for moving pages.
Until mid of July, most of the second part of the project should be finished. In a week from there, the last part would be completed, too. A month is then reserved for bug-testing, tweaking and as a buffer for unexpected obstacles. The MediaWiki community is very important in this step for testing and feedback.
Regards
ewww Tim.
I didn't know about you planning to apply for this work, I thought you were on something along the lines of bug #167.
It happens that I was also willing to work on that category renaming feature, even if only brion, Simetrical and VasilieVVV knew about it.
My application / project idea description is here : http://en.wikipedia.org/wiki/User:NicDumZ/GSoC_2008
On 02/04/2008, Nicolas Dumazet nicdumz@gmail.com wrote:
ewww Tim.
I didn't know about you planning to apply for this work, I thought you were on something along the lines of bug #167.
It happens that I was also willing to work on that category renaming feature, even if only brion, Simetrical and VasilieVVV knew about it.
My application / project idea description is here : http://en.wikipedia.org/wiki/User:NicDumZ/GSoC_2008
-- Nicolas Dumazet — NicDumZ Deuxième année ENSIMAG. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
That is unfortunate. Seems as though we started around the same time. According to the GSoC FAQ [1] though, "That's fine, a little duplication is par for the course in open source."
[1] http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_multiple_accepted
2008/4/2, Tim Johansson gurktim@gmail.com:
That is unfortunate. Seems as though we started around the same time. According to the GSoC FAQ [1] though, "That's fine, a little duplication is par for the course in open source."
[1] http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_multiple_accepted
Yes, I read that too.
However your implementation proposal is far better than mine, obviously. I was indeed focused on modifying wikitext, which does not make much sense when supporting category redirects will work just the same.
Even having category redirects work properly (so that if [[Category:Foo]] redirects to [[Category:Bar]], putting an article in [[Category:Foo]] means it shows up in [[Category:Bar]] - much like redirected templates work) would be most helpful in allowing Commons to use languages other than English for its category tree - so that different-language names for the same thing would work the same, e.g. [[Category:Horse]], [[Category:Cheval]] and [[Category:Hauspferd]] could all point to [[Category:Equus caballus]] and Just Work.
- d.
On Tue, Apr 1, 2008 at 5:39 PM, Tim Johansson gurktim@gmail.com wrote:
Second, all articles in the relevant category must have their category links changed. There are several obstacles involved in this task:
- Finding all alternative ways of categorizing articles. It is simple
to match the simple category links and category lists, but more difficult to find e.g. categories included from a template. Roan Kattouw (Catrope) suggested category redirects for this, such that all articles categorised as [[Category:A]] would also be listed at [[Category:B]] if the prior has been redirected to the latter.
This is the way to do it. When we move an article currently, we don't try to change the links to that article in all pages' wikitext, do we? It would be hopeless. We rewrite the target at the point where the link is followed, not where it's created.
- Articles might be in the process of being edited as the movement is
done. This, however, can be solved in the same manner as edit collisions are currently solved.
I don't think it's necessary to worry about this. The wikitext of the categorized pages should be unaffected, so there's nothing to resolve. The analogy of ordinary links is helpful here.
- The algorithm would likely have high complexity and would thus not
scale well with very large categories. This is likely to constitute a significant and challenging part of the project.
One implementation for this would be
1) Change cl_to to two columns: cl_to_id and cl_final_id. cl_to_id would contain the id of the category that it's actually included in, whereas cl_final_id would be the id of the category it's included in once all redirects are resolved.
2) When querying what category something is in for the purposes of category pages, etc., use cl_final_id, not cl_to_id.
3) When moving a category, change nothing in the categorylinks table; the same cat_id will just refer to a new name. Create a redirect as usual.
4) When changing an existing redirect (e.g., deleting it), or changing an existing category into a redirect, just do UPDATE categorylinks SET cl_final_id=$newdestination WHERE cl_to_id=$changedcat. This part will be slow for large categories, perhaps unacceptably so for very large ones. This is comparable to deleting large pages at present and may need to be treated similarly.
This implementation is not normalized, which is why it's slow for changing redirects. We could just use the same technique we use for pages: join to the redirect table on every select. The problem is that this doesn't work so well for necessities like sorting, as far as I can see. You have to be able to sort efficiently when doing retrieval for category pages. I'm a little tired right now, but I can't see offhand how to do this in a way that's efficient for both updating and selecting, you're right.
As the last step, the relevant entries in the categorylinks table would need to be changed. This is accomplished by a simple SQL query. This could be avoided if bug #13579 [1] ("Category table should use category ID rather than category name") is fixed, which it could be as part of this project.
Well, the simple SQL query could turn out to be a problem for very large categories. I might be wrong; a single update may well run faster than the insert/delete we have right now for large page deletions.
The project would preferably be written as a patch to the core. Catrope suggested setting up a separate SVN branch for the project, such that everyone can see my progress.
Yes, certainly.
After the community bonding period
:)
On Tue, Apr 1, 2008 at 6:07 PM, David Gerard dgerard@gmail.com wrote:
Even having category redirects work properly (so that if [[Category:Foo]] redirects to [[Category:Bar]], putting an article in [[Category:Foo]] means it shows up in [[Category:Bar]] - much like redirected templates work) would be most helpful in allowing Commons to use languages other than English for its category tree - so that different-language names for the same thing would work the same, e.g. [[Category:Horse]], [[Category:Cheval]] and [[Category:Hauspferd]] could all point to [[Category:Equus caballus]] and Just Work.
The redirects seem like the hard part here. Once those are in place, moving should be pretty easy. It practically just automates what users could easily do anyway by copying over the page content and adding a redirect manually.
Simetrical wrote:
The redirects seem like the hard part here. Once those are in place, moving should be pretty easy. It practically just automates what users could easily do anyway by copying over the page content and adding a redirect manually.
I'd even say it's the only part of category renaming. After we handle category redirects, we will only have to change Namespace.php --VasilievVV
On Wed, Apr 2, 2008 at 2:41 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
- Change cl_to to two columns: cl_to_id and cl_final_id. cl_to_id
would contain the id of the category that it's actually included in, whereas cl_final_id would be the id of the category it's included in once all redirects are resolved.
- When querying what category something is in for the purposes of
category pages, etc., use cl_final_id, not cl_to_id.
Wouldn't it be easier for upgrading and backwards compatibility to keep the current cl_to field which should indicate the category that is indicated in wikitext, and add a cl_id field, which indicates the real category that is being pointed to.
Bryan
Bryan Tong Minh schreef:
On Wed, Apr 2, 2008 at 2:41 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
- Change cl_to to two columns: cl_to_id and cl_final_id. cl_to_id
would contain the id of the category that it's actually included in, whereas cl_final_id would be the id of the category it's included in once all redirects are resolved.
- When querying what category something is in for the purposes of
category pages, etc., use cl_final_id, not cl_to_id.
Wouldn't it be easier for upgrading and backwards compatibility to keep the current cl_to field which should indicate the category that is indicated in wikitext, and add a cl_id field, which indicates the real category that is being pointed to.
That's probably a good idea.
Simetrical schreef:
Well, the simple SQL query could turn out to be a problem for very large categories. I might be wrong; a single update may well run faster than the insert/delete we have right now for large page deletions.
That's why I suggested using the category table rather than changing lots of rows in categorylinks.
- When changing an existing redirect (e.g., deleting it), or changing
an existing category into a redirect, just do UPDATE categorylinks SET cl_final_id=$newdestination WHERE cl_to_id=$changedcat. This part will be slow for large categories, perhaps unacceptably so for very large ones. This is comparable to deleting large pages at present and may need to be treated similarly.
Yes, at least something will suck here. I think your suggestion is preferable (making changing popular category redirects suck rather than making moving large categories suck), but maybe we could use the job queue here rather than a huge UPDATE query.
There is one thing nobody mentioned yet: nonexistent categories can have members, so it's possible to move one category on top of another one. For example, let [[Category:A]] be an existent category and [[Category:B]] a nonexistent one that does have members. If [[Category:A]] is then moved to [[Category:B]] (which is allowed, since the target doesn't exist), the categories would have to be merged. The thing is that A and B had different category IDs before the move, but the merged category will only have one ID after the move. This again means updating category IDs in the categorylinks table. We could probably use row count estimates here to decide which ID the unified category gets (A's or B's, depending on which one would result in more rows being changed) and stuff the UPDATEs in the job queue if both estimates are unacceptably large.
Roan Kattouw (Catrope)
An update on this : Tim Johansson is not working anymore on this project, I am doing it for the GSoC :)
First, I notice that all the current proposed move/redirect solutions will always display the old link at the bottom of the article. I first saw category moves as a great opportunity to perform in core the category moves bot owners do every days on WM projects, most of the time to comply to a naming convention; however these current solutions do not allow this. Surely, clicking on the category link would redirect the user to the page with the standardized name, but we might want to directly follow the category redirect for display instead, dont you think ? (And this might be a $wgResolveCategoryRedirects global boolean)
Also, if we decide to turn cl_to into a cat_id, it will require us, when inserting a new category into an article, to fetch the cat_id corresponding to the title we got from the wikitext to update category_links.
- Change cl_to to two columns: cl_to_id and cl_final_id. cl_to_id
would contain the id of the category that it's actually included in, whereas cl_final_id would be the id of the category it's included in once all redirects are resolved.
- When querying what category something is in for the purposes of
category pages, etc., use cl_final_id, not cl_to_id.
- When changing an existing redirect (e.g., deleting it), or changing
an existing category into a redirect, just do UPDATE categorylinks SET cl_final_id=$newdestination WHERE cl_to_id=$changedcat. This part will be slow for large categories, perhaps unacceptably so for very large ones. This is comparable to deleting large pages at present and may need to be treated similarly.
I'm wondering why you need that cl_to_id. To show the closest membership of a category ? (If Category:A redirects to Category:B, be able to display on title=Category:A&redirect=no the pages that directly belongs to A ?) Do we need this ? I would say that knowing the final destination is enough ?!
And here again we have the "problem" of finding what cl_final_id is, knowing a title. We do need to fetch a row in the Page table, to know if this is a redirect. And if it is, a join between redirect and category is needed to get the cat_id
I am thinking of a table containing for each category page title the cat_id it refers to. When updating a page, a query joining that table and the category table could retrieve A) the cat_id needed to update category_links B) possibly the cat_title to display the "proper" links at the bottom of the article.
I am thinking of a table containing for each category page title the cat_id it refers to. When updating a page, a query joining that table and the category table could retrieve A) the cat_id needed to update category_links B) possibly the cat_title to display the "proper" links at the bottom of the article.
To clarify a bit my point, I'm saying that when Category:A redirects to Category:B, articles belonging to A and B should point to the same cat_id
Nicolas Dumazet wrote:
I am thinking of a table containing for each category page title the cat_id it refers to. When updating a page, a query joining that table and the category table could retrieve A) the cat_id needed to update category_links B) possibly the cat_title to display the "proper" links at the bottom of the article.
To clarify a bit my point, I'm saying that when Category:A redirects to Category:B, articles belonging to A and B should point to the same cat_id
Not sure if i'm understanding you, but are taking into account that after merging categories A and B (by way of making a redirect from A to B) someone may revert/split it by making A a different category? I think that could be the reasoning for cl_to_id. And that use case is tricky enough to break many easy solutions :)
2008/6/29 Platonides wrote:
Not sure if i'm understanding you, but are taking into account that after merging categories A and B (by way of making a redirect from A to B) someone may revert/split it by making A a different category? I think that could be the reasoning for cl_to_id. And that use case is tricky enough to break many easy solutions :)
No, that's it. I was merely considering that a category move would be irreversible, because this is how it is made at the moment : when a bot moves all articles from Cat A to Cat B, there's no easy way to undo that move.
On Sun, Jun 29, 2008 at 5:29 AM, Nicolas Dumazet nicdumz@gmail.com wrote:
No, that's it. I was merely considering that a category move would be irreversible, because this is how it is made at the moment : when a bot moves all articles from Cat A to Cat B, there's no easy way to undo that move.
That's not a really acceptable state of affairs, though. It would cause havoc if someone merged two big categories irreversibly. This is true for bots, too, but bots are easily blocked before they get very far, given rate limits, unless they're specifically trusted enough to be flagged.
wikitech-l@lists.wikimedia.org