Hi Everyone,
I've done further analysis on the ~1400 zero-results non-DOI query corpus,
looking at the effects of perfect (or at least human-level) language
detection, and the effects of running all queries against many wikis.
In summary:
More that 85% of failed queries to enwiki are in
English, or are not in a
particular language. Only about 35% of non-English queries in some language
(<4.5% of zero-results queries), if funneled to the right language wiki,
get any results.
The types of queries most likely to get results from the non-enwikis are
names and queries in English. There are lots of
English words in
non-English wikis (enough that they can do decent spelling correction!),
and the idiosyncrasies of language processing on other wikis allow certain
classes of typos in names and English words to match, or the typos happen
to exist uncorrected in the non-enwiki.
Perhaps a better approach to handling non-English queries is user-specified
alternate languages.
More details:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_…
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation