Hi Everyone,
I've done further analysis on the ~1400 zero-results non-DOI query corpus, looking at the effects of perfect (or at least human-level) language detection, and the effects of running all queries against many wikis.
In summary:
More that 85% of failed queries to enwiki are in English, or are not in a particular language. Only about 35% of non-English queries in some language (<4.5% of zero-results queries), if funneled to the right language wiki, get any results.
The types of queries most likely to get results from the non-enwikis are
names and queries in English. There are lots of English words in non-English wikis (enough that they can do decent spelling correction!), and the idiosyncrasies of language processing on other wikis allow certain classes of typos in names and English words to match, or the typos happen to exist uncorrected in the non-enwiki.
Perhaps a better approach to handling non-English queries is user-specified
alternate languages.
More details:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
wikimedia-search@lists.wikimedia.org