I understand that the method used by Kolkus and Rehurek is dictionary-based (word unigrams)? It will outperform cybozu (char ngram based) on small texts. I think it's true if the text is like tweets with short phrases but may not work properly for names? This certainly deserves some test on real data.=Results from other wikis=
This raises another question as we add more fall-back methods to decrease the zero result rate. How will we prioritize the fall-back methods? I mean if I can re-run a "Did you mean" query and if I know that running the original query against another wiki has good chances to give results which one should I try first?
(G) Another interesting question: if we end up implementing several option for improving search results, we will have to figure out how to stage them and in what order to try/test for them.
I think it's worth running [the cross-wiki] test regularly and see how results change.
(B) Make multilingual results configurable—If we know, say, the top four wikis likely to give good results for queries from the English wiki are Sp, Fr, DE, and JP, we could have a expanding section (excuse an UI ugliness—someone with UI smarts can help us figure out how to make it pretty, right?) to enable multi-lingual searching, so on English Wikipedia I could ask for “back up results” in Spanish and French, but not German and Japanese. Store those settings in a cookie for later, too, possibly with some UI indicator that multilingual backup results are enabled. (Also, if the cookie is available at query time, we could save unnecessary cross-wiki searches the user couldn’t possibly use.)
There is maybe sensible defaults per language?
I don't know if it's technically plausible but AFAIK we have the wikibase id in the index so it's should be pretty simple to extract it. Interwiki links are stored in wikidata, could we use WDQS for that purpose? With the entity ID it should be easy to request the interwiki link for a specific language. Is WDQS designed for this usage (high number of query/sec on rather simple queries)?(D) Another sneakier idea that came to mind—which may not be technically plausible—would be to find good results in another language and then check for links back to wiki articles in the wiki the search came from.
Reducing to 1 char the prefix length can hurt perfs and it's certainly a good idea to do this in 2 passes as Erik suggested.=Misspellings=
While working on prefixes I tried to analyze data on simple wiki dump and extracted the distribution of term frequency by prefix length. I failed to make any good usage of the data yet but I'm sure you will :)
I described a way to analyze the content we have in the index here: https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump
It's still on a very small dataset but if you find it useful maybe we could try on a larger one?
My point here is (in the long term): maybe it's difficult to build good suggestions from data directly so why not build a custom dictionary/index to handle "Did you mean suggestions"?. According to https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from search queries to build these suggestions. Is this something worth trying?