Very glad to hear there is progress on the category intersection front. A few comments:
1) I hope there are some broad discussions about how to integrate this into Wikipedia and other projects. Several of us (mostly admins who focus on categorization) have been discussing this on English Wikipedia at [[Wikipedia:Category intersection]]. You can also get there with the shortcut [[WP:CI]]. If we could designate one place to discuss user interface ideas and designs, that would be helpful.
2) If implementing the intersection of more than 2 categories involves nesting, it might be fastest if you start with the smallest categories and continue to process with progressively larger categories. The intersections of the small categories will result in a small result, and each pass the result is at most the same number of members and likely much smaller, so the next intersection might be much faster. If this is the case, then it would be possible to speed up the servers by requiring at least one of the intersected categories to be below some set maximum number of members. Users could get a message that all of the categories selected were too large. The worst case seems to be the intersection of only huge categories with very few or no members in common. If you add a small category to the mix it should be much faster as long as you start with the small one.
3) If the table structure of categories is going to be redesigned, could the same or similar structure be used for links? This way we could also implement link intersections with the same code, and use the wiki-links that already exist on pages as tags. See [[Wikipedia:Link intersection]] (shortcut [[WP:LI]]) for more about this.
-- Samuel Wantman [[en:User:Sam]]
On Fri, Mar 7, 2008 at 9:10 AM, Samuel Wantman wantman@earthlink.net wrote:
Very glad to hear there is progress on the category intersection front.
And I welcome any help :-)
A few comments:
- I hope there are some broad discussions about how to integrate this
into Wikipedia and other projects. Several of us (mostly admins who focus on categorization) have been discussing this on English Wikipedia at [[Wikipedia:Category intersection]]. You can also get there with the shortcut [[WP:CI]]. If we could designate one place to discuss user interface ideas and designs, that would be helpful.
ATM, I only have a special page with a textbox where one can enter a list of categories, click a button, and get pages in this intersection. Works, but is ugly. No problem adding nicer interfaces, though; the main thing is the algorithm beneath it.
- If implementing the intersection of more than 2 categories involves
nesting, it might be fastest if you start with the smallest categories and continue to process with progressively larger categories. The intersections of the small categories will result in a small result, and each pass the result is at most the same number of members and likely much smaller, so the next intersection might be much faster. If this is the case, then it would be possible to speed up the servers by requiring at least one of the intersected categories to be below some set maximum number of members. Users could get a message that all of the categories selected were too large. The worst case seems to be the intersection of only huge categories with very few or no members in common. If you add a small category to the mix it should be much faster as long as you start with the small one.
And I would have done exactly that, except there's no quick way to get the size of a category. There was a discussion on this list about storing the size of a category in a special field recently.
- If the table structure of categories is going to be redesigned, could
the same or similar structure be used for links? This way we could also implement link intersections with the same code, and use the wiki-links that already exist on pages as tags. See [[Wikipedia:Link intersection]] (shortcut [[WP:LI]]) for more about this.
I'm not redesigning the categories table, I'm adding a new one for the pre-calculated intersection hashes. My approach is based on pre-calculating intersections of every combination of categories used in an article. That, in turn, will only work if the average number of categories per article is low.
We have lots of links, which would result in hundreds or thousands of rows per article. That pretty much invalidates my approach for that purpose, on practical (space and lookup time) reasons.
Reading WP:LI, the "similar pages" could be implemented another way: 1. For page A, chose the links you're interested in (e.g., B, C, D) 2. Get "what links here" for B, C, D 3. List all pages that show up three times Should not be hard. I'll try to make a toolserver thing for that.
Cheers, Magnus
On Fri, Mar 7, 2008 at 4:10 AM, Samuel Wantman wantman@earthlink.net wrote:
- I hope there are some broad discussions about how to integrate this
into Wikipedia and other projects. Several of us (mostly admins who focus on categorization) have been discussing this on English Wikipedia at [[Wikipedia:Category intersection]]. You can also get there with the shortcut [[WP:CI]]. If we could designate one place to discuss user interface ideas and designs, that would be helpful.
If there is, other than Wikitech-l, it should not be muddied with policy questions as well, as the page you link to is. I am not interested in how Wikipedia is going to categorize things.
The interface I had in mind was basically a box on every category page, saying something to the effect of
Enter a category to require: [____________________________] Enter a category to exclude: [____________________________] ( Submit )
Although the wording could use improvement. Filling one or both of the fields and clicking Submit would jump to a special page that would do the intersection, have a box to add/remove another category (or many of them at once). It would also list all categories currently represented, with little (X) links next to them to remove them from the result set.
This would be the basic functionality. Additional stuff like a Suggestions button (to give a partial list of categories with nonzero intersections with the present category, preferably the largest ones) would be valuable and perhaps feasible additions, but there's no gain in trying to do too much at once.
- If implementing the intersection of more than 2 categories involves
nesting, it might be fastest if you start with the smallest categories and continue to process with progressively larger categories. The intersections of the small categories will result in a small result, and each pass the result is at most the same number of members and likely much smaller, so the next intersection might be much faster. If this is the case, then it would be possible to speed up the servers by requiring at least one of the intersected categories to be below some set maximum number of members. Users could get a message that all of the categories selected were too large. The worst case seems to be the intersection of only huge categories with very few or no members in common. If you add a small category to the mix it should be much faster as long as you start with the small one.
This is a nonissue if fulltext is used. Or rather if it is an issue, it's out of our control, and undoubtedly thought of already.
- If the table structure of categories is going to be redesigned, could
the same or similar structure be used for links? This way we could also implement link intersections with the same code, and use the wiki-links that already exist on pages as tags. See [[Wikipedia:Link intersection]] (shortcut [[WP:LI]]) for more about this.
Possibly. One thing at a time. There's no point in writing up an implementation of everything at once only to find that there's some unforeseen basic flaw in the way you're doing it, and having to throw out all the shiny extra bits you spent so much time on. We don't even know for sure what efficiency will be like yet, whether fancy fulltext solutions will be good enough or not.
wikitech-l@lists.wikimedia.org