Robert Stojnic schreef:
Let me briefly repeat what I said earlier about my experience with this category intersection thingy. Adding categories to lucene index is easy *IF* they are inside the article, e.g. try this:
http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Binc...
This will give you category intersection of "Living People" and "English comedy writers" in fraction of the second.
That's the dirty way. I've gone ahead and written an alternative way of implementing category intersections using a fulltext search, which means you can run the most crazy intersections; in fact, you can search in an article's categories as if they were the page's contents. It's part of the AdvancedSearch extension which I'm paid to write, but it'll be easy to split off just the intersection functionality into another extension. The upside is that I also have a special page front end ready to go. I'll commit AdvancedSearch into SVN once I've worked out the bugs (provided there are any; it's close to midnight now so I don't really feel like testing stuff any more) and worked out stuff with my 'employer', which shouldn't take more than a few days.
On a technical level, the extension adds the categorysearch table (you need to run update.php to actually create the table), which is basically a rip-off from the searchindex table. It has a cs_page field referencing page_id, and keeps itself updated using the LinksUpdate and ArticleDeleteComplete hooks. There's also a maintenance script to populate the table from scratch.
What I found that the hard part is keeping the index updated. If we want a fancy category intersection system discussed here before we need to have an index that is frequently updated, that will be integrated with the job queue, that will understand templates etc..
Understanding templates is no problem here, since the updater uses the parser's notion of which categories the page is in, and the populate script uses the categorylinks table.
Lucene is not that good with very frequent updates. The usual setting is to have an indexer, make snapshots of the index at regular intervals and then rsync it onto searchers. The whole process takes time, although for a category-only index it will probably be fast. I assume there would be at least few tens of minutes lag anyhow. Our current lucene framework could easily be used for index distribution and such.
I really don't have the faintest idea how Lucene works or how MediaWiki interfaces with it, but I do know that Lucene can handle the stuff we put into the searchindex table. Since the categorysearch table is no different, I think Lucene *should* be able to handle it pretty easily as well. Could someone who actually has a clue about all this reply?
Roan Kattouw (Catrope)