Re: [Wikitech-l] Category tables and Category Intersection - Wikitech-l

22 Feb 2008

Simetrical wrote:

...

 We don't have to move off MySQL, we just have to use a different
 system for this one feature.  That's perfectly plausible; we use
 Lucene for search.

 Ah, something I actually know something about.  This is the third or fourth
time, to my knowledge, that we've discussed category intersection in depth.
Last year (I think it was last year) I did a bunch of pretty extensive
testing, including running MySQL queries against the categories table using
various methods (joins, subselects, you name it) and the consensus was that
was way too slow (queries against large categories were awful - Living
People was a test case).

So, I also loaded the categories into the cur table (I'm using an old
schema) and created a field holding all the categories with underscores for
spaces in the categories (like it appears in the url).  This made MySQL's
fulltext index see the whole category as one word.  This performed *much*
faster, and you could use boolean queries to get fancy.

I also created a lucene index which I queried with zend_search_lucene.  This
actually performed pretty comparably to the MySQL fulltext index.  It's all
in the archives somewhere.  I think either of those solutions would probably
be okay, but if it's wildly poplular the load might be a bit much.  I didn't
get (that I recall) any really conclusive opinions from the group or the
core developers.

But, based on all that, here's my suggestion:  create a new lucene index of
categories using all the existing tools, and do boolean queries against
that.  I think it's the path of least resistance, and the performance should
be quite acceptable (pretty much be definition).

On a related topic, anybody on the list mess around with clucene?  I'm still
playing with it off and on... (I'm a novice at c/c++) seems like a good
choice for a high performance web based search (doesn't have the overhead of
being Java)...

Best Regards,
Aerik