Re: [Wikitech-l] new search backend test

10 Apr 2008

      2008/4/9 Robert Stojnic rainmansr@gmail.com:
...
You can test german and french on the test site, as for others, it works
identical
for every language - it's not as good for non-latin scripts since metaphones
cannot be used, but should work for any language that uses alphabet. I've
tested
cyrillic for sr, and it works as expected. Spell checking will be disabled
for chinese,
japanese and korean since i couldn't find an open source package to
determine
edit distance between logograms.
These days Korean uses logograms only very rarely. Another likely problem
is languages which don't put spaces between words. Korean does use spaces
but along with Chinese and Japanese, Thai does not use spaces. Khmer
(Cambodian) also doesn't use spaces but it's only just beginning to appear on
the internet whereas Thai has seen plenty of use for years.
Andrew Dunbar.
...
As for how it normally works, it spell-checks individuals words (based on
edit distance,
metaphones, frequency) then how the words fit into phrases, also tries to
find words that
match in context, and matches whole titles.
r.
On Wed, Apr 9, 2008 at 7:19 AM, Gerard Meijssen gerard.meijssen@gmail.com
wrote:
...
Hoi,
How does this "did you mean" work in languages like Dutch, German, French
but also Russian, Georgian, Armenian, Swahili ???
Thanks,
    GerardM
On Wed, Apr 9, 2008 at 1:36 AM, Robert Stojnic rainmansr@gmail.com
wrote:
...
After much delay, I've completed a new release candidate for our
internal
...
search engine. The testing site where you can see it action is same as
before [1], with indexes rebuilt from latest dumps.
Here are some highlights:

spell checking (aka did you mean...)
ajax prefix suggestions (reimplemented Julien's engine)
nicer highlighting
improved scoring
fuzzy queries, e.g. sarah~ thomson~ will give you all the variations

of both of the words

suffix wildcards (works on title words only), e.g. *stan will give you

all the -stan countries of central asia - for performance reasons it
won't work nicely on huge sets of words
It also has some other features that might or might not be included
in final release. For instance, "related articles" - if you click the
Related link next to the article you will get a list of other articles
that occur frequently together with it. This list is internally used
to provide context for every article, but I figured it might be
interesting for end users as well...
I've also documented some of the algorithms I developed at [2]. There
you can find out more about how scoring and spell checking works.
Search is a bit slowish, especially on enwiki, since I've crammed all of
its revision text, spellcheck indexes, search indexes and other stuff on
a single host. According to my tests, typical search should be in
150-180ms range (of CPU time), which is much slower than current
(25-30ms).
Most overhead comes from spell checking and highlighting. I was
thinking of trying to use some of the 8-cpu boxes...
The ajax suggestions (when properly cached in RAM) are pretty fast
(0.2-0.4ms), so we could probably enable it side-wide on search boxes
and such. Initially it would be update once a day, but we could cut
that down, depending on number of servers and actual number of requests.
Comments & suggestions are welcome!
[1] http://ls2.wikimedia.org/
[2] http://www.mediawiki.org/wiki/User:Rainman/search_internals

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] new search backend test