New subject: So... status of category intersections?

23 May 2008

On Thu, May 22, 2008, Roan Kattouw &lt;roan.kattouw(a)home.nl&gt; wrote:

...

 I've gone ahead and written an alternative way of
 implementing category intersections using a fulltext search, which means
 you can run the most crazy intersections; in fact, you can search in an
 article's categories as if they were the page's contents. It's part of
 the AdvancedSearch extension which I'm paid to write, but it'll be easy
 to split off just the intersection functionality into another extension.
 The upside is that I also have a special page front end ready to go.
 I'll commit AdvancedSearch into SVN once I've worked out the bugs
 (provided there are any; it's close to midnight now so I don't really
 feel like testing stuff any more) and worked out stuff with my
 'employer', which shouldn't take more than a few days. 

Wow, awesome! - you (and your employer) beat the heck out of all my good
intentions to acquaint myself with the current version of Mediawiki and
write code good enough for production!  I can't wait to see it.

On a technical level, the extension adds the categorysearch table (you
...
  need to run update.php to actually create the table),
which is basically
 a rip-off from the searchindex table. It has a cs_page field referencing
 page_id, and keeps itself updated using the LinksUpdate and
 ArticleDeleteComplete hooks. There's also a maintenance script to
 populate the table from scratch.
  What I found that the hard part is keeping the
index updated. If we want
 a fancy category
 intersection system discussed here before we need to have an index that
 is frequently updated,
 that will be integrated with the job queue, that will understand
 templates etc..
  Understanding templates is no problem here, since the updater uses the
 parser's notion of which categories the page is in, and the populate
 script uses the categorylinks table. 

Perfect - yes exactly the way to go.

...
  Lucene is not that good with very frequent updates.
The usual setting is
  to have an indexer,
 make snapshots of the index at regular intervals and then rsync it onto
 searchers. The whole
 process takes time, although for a category-only index it will probably
 be fast. I assume there
 would be at least few tens of minutes lag anyhow. Our current lucene
 framework could
 easily be used for index distribution and such.
  I really don't have the faintest idea how Lucene works or how MediaWiki
 interfaces with it, but I do know that Lucene can handle the stuff we
 put into the searchindex table. Since the categorysearch table is no
 different, I think Lucene *should* be able to handle it pretty easily as
 well. Could someone who actually has a clue about all this reply?

 Lucene doesn't allow edits, it only allows add and delete.  Presumably too
many deletes make the index inefficient or something.  But I think all that
is moot - once you've got the categories into their own table, it *should*
be simple to set up another index on the same type schedule/etc. as the base
search index, and point it to that table.  Then, change the interface to
point to Lucene instead of MySQL.  I'm not familiar with Wikipedia's Lucene
backend, but... It seems reasonable to assume that this is not a major
endeaver.

What's your UI for the intersections look like?  That was the killer for me;
I'm a weak UI guy.  I'd imagine (and implemented a rough prototype years
ago) that let you "browse" intersections - ie, given intersection a it would
show you the set of all categories B that have documents that have category
a.  Ideally the most frequently used categories appear at the top :-)  But I
never did any performance testing for this set up, and additionally, I'm not
sure how to do it in Lucene... Anyway, what's your interface like?

Best Regards,
Aerik

-- 
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!

Re: [Wikitech-l] So... status of category intersections?