To keep the message size down, I'm going to trim heavily..
=Results from other wikis=
I understand that the method used by Kolkus and Rehurek is dictionary-based (word unigrams)? It will outperform cybozu (char ngram based) on small texts. I think it's true if the text is like tweets with short phrases but may not work properly for names? This certainly deserves some test on real data.
Yeah, names are a pain for lots of reasons. n-grams may help categorize them ethnolinguistically, similarly to language identification, but that doesn't tell you where to search. For example, Célia Šašić is a German footballer with a French first name and Croatian last name (by marriage)—and she's not in the Croatian wiki, though she is in English, German, French, and others. Did I mention that names are a pain?
A couple more interesting ideas from a couple of papers (though there's always a danger of falling down the literature rabbit hole):
Looking at tweets: http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf - good results on tweets with Naive Bayes classifier built on words, and decent results with a simple ranked list of the top N words - in both cases they added simple suffix scoring to get what I think of as the best bit of n-grams
Looking at "query-style" texts: http://www.uni-weimar.de/medien/webis/publications/papers/lipka_2010.pdf - claim good results with a Naive Bayes classifier built on n-grams—though they use 4-grams and 5-grams
But, yeah, everything comes down to is it fast, easy to implement, and how does it perform on real data.
This raises another question as we add more fall-back methods to decrease the zero result rate. How will we prioritize the fall-back methods? I mean if I can re-run a "Did you mean" query and if I know that running the original query against another wiki has good chances to give results which one should I try first?
Yep, that was my point (G):
(G) Another interesting question: if we end up implementing several option
for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
I think it's worth running [the cross-wiki] test regularly and see how results change.
I agree.
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
I think we can look for defaults per language in terms of where it makes sense to look based on the fact that we're likely to find something. No point looking in language X—even if the user can read it—if we never find anything in X.
But what languages to search really make the most sense per user, don't they? At least for ranking. I'd much rather have a mediocre result in a language I can read than a perfect result in a language I can't read. We could limit it by where we think we'll find something based on our tests, but the user should be able to further limit results based on whether they can use them.
(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from.
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)?
I also thought of WDQS for this. We should ask Stas.
=Misspellings=
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.
Yeah, I worry about performance with prefix=1—but we can test it in small scale and see what it costs and how much it helps.
While working on prefixes I tried to analyze data on simple wiki dump and extracted the distribution of term frequency by prefix length. I failed to make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here:
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump It's still on a very small dataset but if you find it useful maybe we could try on a larger one?
I will take a look! (Two caveats: I don't really have superpowers, so maybe there's not much there. I'll add it to my stack, which is getting bigger every day.)
My point here is (in the long term): maybe it's difficult to build good suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?
Definitely worth trying. Erik's got a patch in for adding a session id ( https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying prefix searches that come right before the user finishes typing their full text query, this would be good for looking for zero-results (or even low-results) queries followed by a similarly-spelled successful query from the same search session. "saerch" followed by "search" gives us hints that the latter is a good suggestion for the former—especially if it happens a lot.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation