On Tue, 22 Apr 2008, Simetrical wrote:
On Tue, Apr 22, 2008 at 10:59 AM, Roan Kattouw roan.kattouw@home.nl wrote:
I missed the explanation of the fulltext implementation. Something like 'Foo With_spaces Bar' and then do a fulltext search for the cats you need? That would be more powerful, and would probably be faster for complex intersections. I'll write an alternative to CategoryIntersections that uses the fulltext schema and run some benchmarks. I expect to have some results by the end of the week.
Aerik Sylvan has already done an implementation of the backend using CLucene. If a front-end could be done in core, with a pluggable backend, that might have the best chance of getting enabled on Wikimedia relatively quickly. MyISAM fulltext is not necessarily going to be fast enough due to the locking.
Yes, I did a fulltext search (which works quite well - I forget the response times... I think it was around a third of a second even for intersections of large groups, like "Living_People") and the way it handles booleans and stuff is quite nice. I think I broke it when I moved servers, but I can put it back up. I think it would probably be a great addition to core, and would be very adequate for small wikis, but too slow for larger ones (performance at a few tenths of a second will really add up with tens or hundreds of hits...) I think doing updates is also an issue on large wikis, due to table locking of the MyISAM table. But, I think it will be fine for small wikis. MySQL doesn't break on underscores, so using the category as it appears in the url seems to work great for fulltext search, and the built in fulltext search is *much* faster than doing lookups on the categorylinks table, especially for large sets.
So, I'd propose in core we add a MyISAM table with a fulltext index of categories - this will suite small wikis. For big wikis, make this a InnoDB table and use it to build a Lucene index, which you'd search with whatever flavor of Lucene you like. This is a fairly straight path, that covers both core and large wikis, should have good performance for either application, and is flexible in that it does boolean searches. I don't have suggestions for an interface, but why not just start with a SpecialPage and see what happens? Once the functionality is there, suggestions for how to better use it will come out of the woodwork.
I'm working on a CLucene daemon (calling it clucened, which is on SF - with slightly out-of-date source in subversion - and at clucened.com), which could be used for this, or anything else. I'm planning to make it Solr compatible, but not a direct port of Solr, and the implementation will have some differences. So far I have only the daemon and the search function (takes a raw query, which can be boolean or have mulitple fields, and passes it through). I think this is really cool, but if we already have a GCJ Lucene search for En, it may be easier just to extend that to read at categories Lucene index than use another architecture. Either way, I think a search daemon will find an audience and will be a really cool thing :-)
Aerik