On Fri, Mar 7, 2008 at 9:10 AM, Samuel Wantman <wantman(a)earthlink.net> wrote:
Very glad to hear there is progress on the category
intersection front.
And I welcome any help :-)
A few comments:
1) I hope there are some broad discussions about how to integrate this
into Wikipedia and other projects. Several of us (mostly admins who
focus on categorization) have been discussing this on English Wikipedia
at [[Wikipedia:Category intersection]]. You can also get there with the
shortcut [[WP:CI]]. If we could designate one place to discuss user
interface ideas and designs, that would be helpful.
ATM, I only have a special page with a textbox where one can enter a
list of categories, click a button, and get pages in this
intersection.
Works, but is ugly. No problem adding nicer interfaces, though; the
main thing is the algorithm beneath it.
2) If implementing the intersection of more than 2
categories involves
nesting, it might be fastest if you start with the smallest categories
and continue to process with progressively larger categories. The
intersections of the small categories will result in a small result, and
each pass the result is at most the same number of members and likely
much smaller, so the next intersection might be much faster. If this is
the case, then it would be possible to speed up the servers by requiring
at least one of the intersected categories to be below some set maximum
number of members. Users could get a message that all of the categories
selected were too large. The worst case seems to be the intersection of
only huge categories with very few or no members in common. If you add
a small category to the mix it should be much faster as long as you
start with the small one.
And I would have done exactly that, except there's no quick way to get
the size of a category.
There was a discussion on this list about storing the size of a
category in a special field recently.
3) If the table structure of categories is going to
be redesigned, could
the same or similar structure be used for links? This way we could also
implement link intersections with the same code, and use the wiki-links
that already exist on pages as tags. See [[Wikipedia:Link
intersection]] (shortcut [[WP:LI]]) for more about this.
I'm not redesigning the categories table, I'm adding a new one for the
pre-calculated intersection hashes. My approach is based on
pre-calculating intersections of every combination of categories used
in an article. That, in turn, will only work if the average number of
categories per article is low.
We have lots of links, which would result in hundreds or thousands of
rows per article. That pretty much invalidates my approach for that
purpose, on practical (space and lookup time) reasons.
Reading WP:LI, the "similar pages" could be implemented another way:
1. For page A, chose the links you're interested in (e.g., B, C, D)
2. Get "what links here" for B, C, D
3. List all pages that show up three times
Should not be hard. I'll try to make a toolserver thing for that.
Cheers,
Magnus