Simetrical wrote:
We don't have to move off MySQL, we just have to use a different system for this one feature. That's perfectly plausible; we use Lucene for search.
Ah, something I actually know something about. This is the third or fourth time, to my knowledge, that we've discussed category intersection in depth. Last year (I think it was last year) I did a bunch of pretty extensive testing, including running MySQL queries against the categories table using various methods (joins, subselects, you name it) and the consensus was that was way too slow (queries against large categories were awful - Living People was a test case).
So, I also loaded the categories into the cur table (I'm using an old schema) and created a field holding all the categories with underscores for spaces in the categories (like it appears in the url). This made MySQL's fulltext index see the whole category as one word. This performed *much* faster, and you could use boolean queries to get fancy.
I also created a lucene index which I queried with zend_search_lucene. This actually performed pretty comparably to the MySQL fulltext index. It's all in the archives somewhere. I think either of those solutions would probably be okay, but if it's wildly poplular the load might be a bit much. I didn't get (that I recall) any really conclusive opinions from the group or the core developers.
But, based on all that, here's my suggestion: create a new lucene index of categories using all the existing tools, and do boolean queries against that. I think it's the path of least resistance, and the performance should be quite acceptable (pretty much be definition).
On a related topic, anybody on the list mess around with clucene? I'm still playing with it off and on... (I'm a novice at c/c++) seems like a good choice for a high performance web based search (doesn't have the overhead of being Java)...
Best Regards, Aerik
wikitech-l@lists.wikimedia.org