new search backend test

List overview All Threads
Download

newer

older

Enabling Xcache after MediaWiki...

position of the edit link

Robert Stojnic

8 Apr 2008 8 Apr '08

4:36 p.m.

After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

Here are some highlights: * spell checking (aka did you mean...) * ajax prefix suggestions (reimplemented Julien's engine) * nicer highlighting * improved scoring * fuzzy queries, e.g. sarah~ thomson~ will give you all the variations of both of the words * suffix wildcards (works on title words only), e.g. *stan will give you all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words

It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...

I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.

Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...

The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.

Comments & suggestions are welcome!

[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals

Show replies by date

Tim Landscheidt

8 Apr 8 Apr

5:02 p.m.

Robert Stojnic rainmansr@gmail.com wrote:

...

After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

Here are some highlights:

spell checking (aka did you mean...)

[...]

Ah! Gone be the last reason to use Google!

Excellent, Tim

Erik Moeller

5:18 p.m.

On 4/8/08, Robert Stojnic rainmansr@gmail.com wrote:

...

After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

How very exciting. :-) Thanks for all your work on this, Robert.

BTW: I recently discovered an interesting search engine that used an autocompletion mechanism I hadn't seen before, where you can essentially autocomplete using multiple words, instead of a single string. For example, you could type "no ma la" and it would find "No Man's Land". There are advantages and disadvantages to that approach, I'm sure, but I found it fun to play with.

-- Erik Möller Deputy Director, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Brion Vibber

6:08 p.m.

Robert Stojnic wrote:

...

After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

Here are some highlights:

spell checking (aka did you mean...)

ajax prefix suggestions (reimplemented Julien's engine)

nicer highlighting

improved scoring

fuzzy queries, e.g. sarah~ thomson~ will give you all the variations

of both of the words

suffix wildcards (works on title words only), e.g. *stan will give you

all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words

Sweeeet! :)

...

Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...

Yeah, we might need to dedicate more hardware to handle that.

...

The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.

Cooooool!

-- brion

Gerard Meijssen

10:19 p.m.

Hoi, How does this "did you mean" work in languages like Dutch, German, French but also Russian, Georgian, Armenian, Swahili ??? Thanks, GerardM

On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com wrote:

...

After much delay, I've completed a new release candidate for our internal search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

Here are some highlights:

spell checking (aka did you mean...)

ajax prefix suggestions (reimplemented Julien's engine)

nicer highlighting

improved scoring

fuzzy queries, e.g. sarah~ thomson~ will give you all the variations

of both of the words

suffix wildcards (works on title words only), e.g. *stan will give you

all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words

It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...

I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.

Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...

The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.

Comments & suggestions are welcome!

[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Stojnic

9 Apr 9 Apr

3:42 a.m.

You can test german and french on the test site, as for others, it works identical for every language - it's not as good for non-latin scripts since metaphones cannot be used, but should work for any language that uses alphabet. I've tested cyrillic for sr, and it works as expected. Spell checking will be disabled for chinese, japanese and korean since i couldn't find an open source package to determine edit distance between logograms.

As for how it normally works, it spell-checks individuals words (based on edit distance, metaphones, frequency) then how the words fit into phrases, also tries to find words that match in context, and matches whole titles.

On Wed, Apr 9, 2008 at 7:19 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, How does this "did you mean" work in languages like Dutch, German, French but also Russian, Georgian, Armenian, Swahili ??? Thanks, GerardM

On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com wrote:

...
After much delay, I've completed a new release candidate for our

internal

...
search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

Here are some highlights:

spell checking (aka did you mean...)

ajax prefix suggestions (reimplemented Julien's engine)

nicer highlighting

improved scoring

fuzzy queries, e.g. sarah~ thomson~ will give you all the variations

of both of the words

suffix wildcards (works on title words only), e.g. *stan will give you

all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words

It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...

I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.

Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...

The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.

Comments & suggestions are welcome!

[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Andrew Dunbar

12:55 p.m.

2008/4/9 Robert Stojnic rainmansr@gmail.com:

...

You can test german and french on the test site, as for others, it works identical for every language - it's not as good for non-latin scripts since metaphones cannot be used, but should work for any language that uses alphabet. I've tested cyrillic for sr, and it works as expected. Spell checking will be disabled for chinese, japanese and korean since i couldn't find an open source package to determine edit distance between logograms.

These days Korean uses logograms only very rarely. Another likely problem is languages which don't put spaces between words. Korean does use spaces but along with Chinese and Japanese, Thai does not use spaces. Khmer (Cambodian) also doesn't use spaces but it's only just beginning to appear on the internet whereas Thai has seen plenty of use for years.

Andrew Dunbar.

...

As for how it normally works, it spell-checks individuals words (based on edit distance, metaphones, frequency) then how the words fit into phrases, also tries to find words that match in context, and matches whole titles.

r.

On Wed, Apr 9, 2008 at 7:19 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, How does this "did you mean" work in languages like Dutch, German, French but also Russian, Georgian, Armenian, Swahili ??? Thanks, GerardM

On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com wrote:

...
After much delay, I've completed a new release candidate for our

internal

...
search engine. The testing site where you can see it action is same as before [1], with indexes rebuilt from latest dumps.

Here are some highlights:

spell checking (aka did you mean...)

ajax prefix suggestions (reimplemented Julien's engine)

nicer highlighting

improved scoring

fuzzy queries, e.g. sarah~ thomson~ will give you all the variations

of both of the words

suffix wildcards (works on title words only), e.g. *stan will give you

all the -stan countries of central asia - for performance reasons it won't work nicely on huge sets of words

It also has some other features that might or might not be included in final release. For instance, "related articles" - if you click the Related link next to the article you will get a list of other articles that occur frequently together with it. This list is internally used to provide context for every article, but I figured it might be interesting for end users as well...

I've also documented some of the algorithms I developed at [2]. There you can find out more about how scoring and spell checking works.

Search is a bit slowish, especially on enwiki, since I've crammed all of its revision text, spellcheck indexes, search indexes and other stuff on a single host. According to my tests, typical search should be in 150-180ms range (of CPU time), which is much slower than current (25-30ms). Most overhead comes from spell checking and highlighting. I was thinking of trying to use some of the 8-cpu boxes...

The ajax suggestions (when properly cached in RAM) are pretty fast (0.2-0.4ms), so we could probably enable it side-wide on search boxes and such. Initially it would be update once a day, but we could cut that down, depending on number of servers and actual number of requests.

Comments & suggestions are welcome!

[1] http://ls2.wikimedia.org/ [2] http://www.mediawiki.org/wiki/User:Rainman/search_internals

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

6101

Age (days ago)

6102

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

6 participants

tags (0)

participants (6)

Andrew Dunbar
Brion Vibber
Erik Moeller
Gerard Meijssen
Robert Stojnic
Tim Landscheidt