After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights: * spell checking (aka did you mean...) * ajax prefix suggestions (reimplemented Julien's engine) * nicer highlighting * improved scoring * fuzzy queries, e.g. sarah~ thomson~ will give you all the variations of both of the words * suffix wildcards (works on title words only), e.g. *stan will give you all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words
It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals