Christophe Henner wrote:
Hi Still about Category search, I looked for something about it but didn't found, what about something making possible to have all the article matching x categorys. For exemple giving the list of all the articles both in [[Category:Writer]] and [[Category:Born in London]]. Have a nice day -- schiste
That's the much discussed and desired "Category Intersections" and is a tricker problem (at scale) than I thought. I've been testing some ideas on and off, but have been slowed down due to having a hard time clearing the query cache on the server I'm using (I've tried "FLUSH QUERY CACHE" and "RESET QUERY CACHE" but they don't seem to actually do it - all you MySQL gurus out there, what am I missing?).
I've got two ideas I want to test:
1) use the existing table and the query I've previously suggested, but constructed it smarter by considering the number of pages in each category - in other words, purposefully narrow down he result set as early as possible (like look at "People born in 1912" and then see how many of those are "Living People" instead of the other way around).
2) Try building a table with a fulltext index using a record for each page, and a column for the categories, delimited by spaces (use underscores for spaces in a category name). This may be a bit hackish, but I'm thinking this will get MySQL to do the tricky part of building the index on categories (each being a word in that column) for me. The MySQL people must've made the fulltext index code as efficient as possible, so it will be interesting to see how it performs. I know full text indexing is not acceptable for whole Wikipedia articles, but if we're only considering categories, we're talking about a lot less text. I've been wondering if maybe this is how Flickr handles tags - whatever they're doing, the functionality seems to match what we want to do, and at a large scale, too.
If neither of these work, then I think we're off into either Lucene or some other search function with a custom index/data structure. But I have the strong impression that those are pretty inherently not updated in real-time, which is a bummer.
Best Regards, Aerik
wikitech-l@lists.wikimedia.org