On Tue, 22 Apr 2008, Simetrical wrote:
On Tue, Apr 22, 2008 at 10:59 AM, Roan Kattouw <roan.kattouw(a)home.nl>
wrote:
I missed the explanation of the fulltext
implementation. Something like
'Foo With_spaces Bar' and then do a fulltext search for the cats you
need? That would be more powerful, and would probably be faster for
complex intersections. I'll write an alternative to
CategoryIntersections that uses the fulltext schema and run some
benchmarks. I expect to have some results by the end of the week.
Aerik Sylvan has already done an implementation of the backend using
CLucene. If a front-end could be done in core, with a pluggable
backend, that might have the best chance of getting enabled on
Wikimedia relatively quickly. MyISAM fulltext is not necessarily
going to be fast enough due to the locking.
Yes, I did a fulltext search (which works quite well - I forget the response
times... I think it was around a third of a second even for intersections of
large groups, like "Living_People") and the way it handles booleans and
stuff is quite nice. I think I broke it when I moved servers, but I can put
it back up. I think it would probably be a great addition to core, and
would be very adequate for small wikis, but too slow for larger ones
(performance at a few tenths of a second will really add up with tens or
hundreds of hits...) I think doing updates is also an issue on large wikis,
due to table locking of the MyISAM table. But, I think it will be fine for
small wikis. MySQL doesn't break on underscores, so using the category as
it appears in the url seems to work great for fulltext search, and the built
in fulltext search is *much* faster than doing lookups on the categorylinks
table, especially for large sets.
So, I'd propose in core we add a MyISAM table with a fulltext index of
categories - this will suite small wikis. For big wikis, make this a InnoDB
table and use it to build a Lucene index, which you'd search with whatever
flavor of Lucene you like. This is a fairly straight path, that covers both
core and large wikis, should have good performance for either application,
and is flexible in that it does boolean searches. I don't have suggestions
for an interface, but why not just start with a SpecialPage and see what
happens? Once the functionality is there, suggestions for how to better use
it will come out of the woodwork.
I'm working on a CLucene daemon (calling it clucened, which is on SF - with
slightly out-of-date source in subversion - and at
clucened.com), which
could be used for this, or anything else. I'm planning to make it Solr
compatible, but not a direct port of Solr, and the implementation will have
some differences. So far I have only the daemon and the search function
(takes a raw query, which can be boolean or have mulitple fields, and passes
it through). I think this is really cool, but if we already have a GCJ
Lucene search for En, it may be easier just to extend that to read at
categories Lucene index than use another architecture. Either way, I think
a search daemon will find an audience and will be a really cool thing :-)
Aerik
--
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!