After much delay, I've completed a new release candidate for our internal
search engine. The testing site where you can see it action is same as
before [1], with indexes rebuilt from latest dumps.
Here are some highlights:
* spell checking (aka did you mean...)
* ajax prefix suggestions (reimplemented Julien's engine)
* nicer highlighting
* improved scoring
* fuzzy queries, e.g. sarah~ thomson~ will give you all the variations
of both of the words
* suffix wildcards (works on title words only), e.g. *stan will give you
all the -stan countries of central asia - for performance reasons it
won't work nicely on huge sets of words
It also has some other features that might or might not be included
in final release. For instance, "related articles" - if you click the
Related link next to the article you will get a list of other articles
that occur frequently together with it. This list is internally used
to provide context for every article, but I figured it might be
interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There
you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of
its revision text, spellcheck indexes, search indexes and other stuff on
a single host. According to my tests, typical search should be in
150-180ms range (of CPU time), which is much slower than current (25-30ms).
Most overhead comes from spell checking and highlighting. I was
thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast
(0.2-0.4ms), so we could probably enable it side-wide on search boxes
and such. Initially it would be update once a day, but we could cut
that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1]
http://ls2.wikimedia.org/
[2]
http://www.mediawiki.org/wiki/User:Rainman/search_internals