After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights: * spell checking (aka did you mean...) * ajax prefix suggestions (reimplemented Julien's engine) * nicer highlighting * improved scoring * fuzzy queries, e.g. sarah~ thomson~ will give you all the variations of both of the words * suffix wildcards (works on title words only), e.g. *stan will give you all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words
It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals
Robert Stojnic rainmansr@gmail.com wrote:
After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights:
- spell checking (aka did you mean...)
[...]
Ah! Gone be the last reason to use Google!
Excellent, Tim
On 4/8/08, Robert Stojnic rainmansr@gmail.com wrote:
After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
How very exciting. :-) Thanks for all your work on this, Robert.
BTW: I recently discovered an interesting search engine that used an autocompletion mechanism I hadn't seen before, where you can essentially autocomplete using multiple words, instead of a single string. For example, you could type "no ma la" and it would find "No Man's Land". There are advantages and disadvantages to that approach, I'm sure, but I found it fun to play with.
Robert Stojnic wrote:
After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights:
- spell checking (aka did you mean...)
- ajax prefix suggestions (reimplemented Julien's engine)
- nicer highlighting
- improved scoring
- fuzzy queries, e.g. sarah~ thomson~ will give you all the variations
of both of the words
- suffix wildcards (works on title words only), e.g. *stan will give you
all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words
Sweeeet! :)
Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...
Yeah, we might need to dedicate more hardware to handle that.
The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.
Cooooool!
-- brion
Hoi, How does this "did you mean" work in languages like Dutch, German, French but also Russian, Georgian, Armenian, Swahili ??? Thanks, GerardM
On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com wrote:
After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights:
- spell checking (aka did you mean...)
- ajax prefix suggestions (reimplemented Julien's engine)
- nicer highlighting
- improved scoring
- fuzzy queries, e.g. sarah~ thomson~ will give you all the variations
of both of the words
- suffix wildcards (works on title words only), e.g. *stan will give you
all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words
It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
You can test german and french on the test site, as for others, it works identical for every language - it's not as good for non-latin scripts since metaphones cannot be used, but should work for any language that uses alphabet. I've tested cyrillic for sr, and it works as expected. Spell checking will be disabled for chinese, japanese and korean since i couldn't find an open source package to determine edit distance between logograms.
As for how it normally works, it spell-checks individuals words (based on edit distance, metaphones, frequency) then how the words fit into phrases, also tries to find words that match in context, and matches whole titles.
r.
On Wed, Apr 9, 2008 at 7:19 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, How does this "did you mean" work in languages like Dutch, German, French but also Russian, Georgian, Armenian, Swahili ??? Thanks, GerardM
On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com wrote:
After much delay, I've completed a new release candidate for our
internal
search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights:
- spell checking (aka did you mean...)
- ajax prefix suggestions (reimplemented Julien's engine)
- nicer highlighting
- improved scoring
- fuzzy queries, e.g. sarah~ thomson~ will give you all the variations
of both of the words
- suffix wildcards (works on title words only), e.g. *stan will give you
all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words
It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2008/4/9 Robert Stojnic rainmansr@gmail.com:
You can test german and french on the test site, as for others, it works identical for every language - it's not as good for non-latin scripts since metaphones cannot be used, but should work for any language that uses alphabet. I've tested cyrillic for sr, and it works as expected. Spell checking will be disabled for chinese, japanese and korean since i couldn't find an open source package to determine edit distance between logograms.
These days Korean uses logograms only very rarely. Another likely problem is languages which don't put spaces between words. Korean does use spaces but along with Chinese and Japanese, Thai does not use spaces. Khmer (Cambodian) also doesn't use spaces but it's only just beginning to appear on the internet whereas Thai has seen plenty of use for years.
Andrew Dunbar.
As for how it normally works, it spell-checks individuals words (based on edit distance, metaphones, frequency) then how the words fit into phrases, also tries to find words that match in context, and matches whole titles.
r.
On Wed, Apr 9, 2008 at 7:19 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, How does this "did you mean" work in languages like Dutch, German, French but also Russian, Georgian, Armenian, Swahili ??? Thanks, GerardM
On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com wrote:
After much delay, I've completed a new release candidate for our
internal
search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.
Here are some highlights:
- spell checking (aka did you mean...)
- ajax prefix suggestions (reimplemented Julien's engine)
- nicer highlighting
- improved scoring
- fuzzy queries, e.g. sarah~ thomson~ will give you all the variations
of both of the words
- suffix wildcards (works on title words only), e.g. *stan will give you
all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words
It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org