On Fri, Mar 7, 2008 at 9:10 AM, Samuel Wantman wantman@earthlink.net wrote:
Very glad to hear there is progress on the category intersection front.
And I welcome any help :-)
A few comments:
- I hope there are some broad discussions about how to integrate this
into Wikipedia and other projects. Several of us (mostly admins who focus on categorization) have been discussing this on English Wikipedia at [[Wikipedia:Category intersection]]. You can also get there with the shortcut [[WP:CI]]. If we could designate one place to discuss user interface ideas and designs, that would be helpful.
ATM, I only have a special page with a textbox where one can enter a list of categories, click a button, and get pages in this intersection. Works, but is ugly. No problem adding nicer interfaces, though; the main thing is the algorithm beneath it.
- If implementing the intersection of more than 2 categories involves
nesting, it might be fastest if you start with the smallest categories and continue to process with progressively larger categories. The intersections of the small categories will result in a small result, and each pass the result is at most the same number of members and likely much smaller, so the next intersection might be much faster. If this is the case, then it would be possible to speed up the servers by requiring at least one of the intersected categories to be below some set maximum number of members. Users could get a message that all of the categories selected were too large. The worst case seems to be the intersection of only huge categories with very few or no members in common. If you add a small category to the mix it should be much faster as long as you start with the small one.
And I would have done exactly that, except there's no quick way to get the size of a category. There was a discussion on this list about storing the size of a category in a special field recently.
- If the table structure of categories is going to be redesigned, could
the same or similar structure be used for links? This way we could also implement link intersections with the same code, and use the wiki-links that already exist on pages as tags. See [[Wikipedia:Link intersection]] (shortcut [[WP:LI]]) for more about this.
I'm not redesigning the categories table, I'm adding a new one for the pre-calculated intersection hashes. My approach is based on pre-calculating intersections of every combination of categories used in an article. That, in turn, will only work if the average number of categories per article is low.
We have lots of links, which would result in hundreds or thousands of rows per article. That pretty much invalidates my approach for that purpose, on practical (space and lookup time) reasons.
Reading WP:LI, the "similar pages" could be implemented another way: 1. For page A, chose the links you're interested in (e.g., B, C, D) 2. Get "what links here" for B, C, D 3. List all pages that show up three times Should not be hard. I'll try to make a toolserver thing for that.
Cheers, Magnus