To keep the message size down, I'm going to trim heavily..
=Results from other wikis=
I understand that the method used by Kolkus and Rehurek is
dictionary-based (word unigrams)? It will outperform cybozu (char ngram
based) on small texts. I think it's true if the text is like tweets with
short phrases but may not work properly for names? This certainly deserves
some test on real data.
Yeah, names are a pain for lots of reasons. n-grams may help categorize
them ethnolinguistically, similarly to language identification, but that
doesn't tell you where to search. For example, Célia Šašić is a German
footballer with a French first name and Croatian last name (by
marriage)—and she's not in the Croatian wiki, though she is in English,
German, French, and others. Did I mention that names are a pain?
A couple more interesting ideas from a couple of papers (though there's
always a danger of falling down the literature rabbit hole):
Looking at tweets:
http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf
- good results on tweets with Naive Bayes classifier built on words,
and decent results with a simple ranked list of the top N words
- in both cases they added simple suffix scoring to get what I think of
as the best bit of n-grams
Looking at "query-style" texts:
http://www.uni-weimar.de/medien/webis/publications/papers/lipka_2010.pdf
- claim good results with a Naive Bayes classifier built on
n-grams—though they use 4-grams and 5-grams
But, yeah, everything comes down to is it fast, easy to implement, and how
does it perform on real data.
This raises another question as we add more fall-back
methods to decrease
the zero result rate. How will we prioritize the fall-back methods? I mean
if I can re-run a "Did you mean" query and if I know that running the
original query against another wiki has good chances to give results which
one should I try first?
Yep, that was my point (G):
(G) Another interesting question: if we end up implementing several option
for improving
search results, we will have to figure out how to stage them
and in what order to try/test for them.
I think it's worth running [the cross-wiki] test
regularly and see how
results change.
I agree.
(B) Make multilingual results configurable—If we know,
say, the top four
wikis likely to give good results for queries from the English wiki are Sp,
Fr, DE, and JP, we could have a expanding section (excuse an UI
ugliness—someone with UI smarts can help us figure out how to make it
pretty, right?) to enable multi-lingual searching, so on English Wikipedia
I could ask for “back up results” in Spanish and French, but not German and
Japanese. Store those settings in a cookie for later, too, possibly with
some UI indicator that multilingual backup results are enabled. (Also, if
the cookie is available at query time, we could save unnecessary cross-wiki
searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
I think we can look for defaults per language in terms of where it makes
sense to look based on the fact that we're likely to find something. No
point looking in language X—even if the user can read it—if we never find
anything in X.
But what languages to search really make the most sense per user, don't
they? At least for ranking. I'd much rather have a mediocre result in a
language I can read than a perfect result in a language I can't read. We
could limit it by where we think we'll find something based on our tests,
but the user should be able to further limit results based on whether they
can use them.
(D) Another sneakier idea that came to mind—which may
not be technically
plausible—would be to find good results in another language and then check
for links back to wiki articles in the wiki the search came from.
I don't know if it's technically plausible but AFAIK we have the wikibase
id in the index so it's should be pretty simple to extract it. Interwiki
links are stored in wikidata, could we use WDQS for that purpose? With the
entity ID it should be easy to request the interwiki link for a specific
language. Is WDQS designed for this usage (high number of query/sec on
rather simple queries)?
I also thought of WDQS for this. We should ask Stas.
=Misspellings=
Reducing to 1 char the prefix length can hurt perfs and it's certainly a
good idea to do this in 2 passes as Erik suggested.
Yeah, I worry about performance with prefix=1—but we can test it in small
scale and see what it costs and how much it helps.
While working on prefixes I tried to analyze data on
simple wiki dump and
extracted the distribution of term frequency by prefix length. I failed to
make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here:
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump
It's still on a very small dataset but if you find it useful maybe we
could try on a larger one?
I will take a look!
(Two caveats: I don't really have superpowers, so maybe there's not much
there. I'll add it to my stack, which is getting bigger every day.)
My point here is (in the long term): maybe it's difficult to build good
suggestions from data directly so why not build a custom dictionary/index
to handle "Did you mean suggestions"?. According to
https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from
search queries to build these suggestions. Is this something worth trying?
Definitely worth trying. Erik's got a patch in for adding a session id (
https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying
prefix searches that come right before the user finishes typing their full
text query, this would be good for looking for zero-results (or even
low-results) queries followed by a similarly-spelled successful query from
the same search session. "saerch" followed by "search" gives us hints
that
the latter is a good suggestion for the former—especially if it happens a
lot.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation