Re: [Wikitech-l] So... status of category intersections?

22 May 2008

Robert Stojnic schreef:
...
  Let me briefly repeat what I said earlier about my
experience with this 
 category
 intersection thingy. Adding categories to lucene index is easy *IF* they 
 are inside
 the article, e.g. try this:

http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Bin…

 This will give you category intersection of "Living People" and "English 
 comedy writers"
 in fraction of the second.
    That's the dirty way. I've gone ahead and written an alternative way of

implementing category intersections using a fulltext search, which means 
you can run the most crazy intersections; in fact, you can search in an 
article's categories as if they were the page's contents. It's part of 
the AdvancedSearch extension which I'm paid to write, but it'll be easy 
to split off just the intersection functionality into another extension. 
The upside is that I also have a special page front end ready to go. 
I'll commit AdvancedSearch into SVN once I've worked out the bugs 
(provided there are any; it's close to midnight now so I don't really 
feel like testing stuff any more) and worked out stuff with my 
'employer', which shouldn't take more than a few days.

On a technical level, the extension adds the categorysearch table (you 
need to run update.php to actually create the table), which is basically 
a rip-off from the searchindex table. It has a cs_page field referencing 
page_id, and keeps itself updated using the LinksUpdate and 
ArticleDeleteComplete hooks. There's also a maintenance script to 
populate the table from scratch.
...
  What I found that the hard part is keeping the index
updated. If we want 
 a fancy category
 intersection system discussed here before we need to have an index that 
 is frequently updated,
 that will be integrated with the job queue, that will understand 
 templates etc..
    Understanding templates is no problem here, since the updater uses the 
parser's notion of which categories the page is in, and the populate 
script uses the categorylinks table.
...
  Lucene is not that good with very frequent updates.
The usual setting is 
 to have an indexer,
 make snapshots of the index at regular intervals and then rsync it onto 
 searchers. The whole
 process takes time, although for a category-only index it will probably 
 be fast. I assume there
 would be at least few tens of minutes lag anyhow. Our current lucene 
 framework could
 easily be used for index distribution and such.
    I really don't have the faintest idea how Lucene works or how MediaWiki 
interfaces with it, but I do know that Lucene can handle the stuff we 
put into the searchindex table. Since the categorysearch table is no 
different, I think Lucene *should* be able to handle it pretty easily as 
well. Could someone who actually has a clue about all this reply?

Roan Kattouw (Catrope)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] So... status of category intersections?