I spent today looking at identifying and converting queries typed on the wrong keyboard on the English and Russian Wikipedias.

Highlights

Looking for mis-keyboarded queries in the "right" character set (ie., Latin on English Wikipedia or Cyrillic on Russian Wikipedia) can explain some gibberish queries and give some improvement in results, but it's very expensive because there are so many candidate queries.

Looking for mis-keyboarded queries in the "wrong" character set (ie., Cyrillic on English Wikipedia or Latin on Russian Wikipedia) can explain a lot of gibberish queries and give better results, especially on Russian Wikipedia, where possibly more than 1% of queries are accidentally typed on the wrong keyboard!

Limiting the scope to only zero-result queries or perhaps poorly performing (fewer than three results) queries could be computationally less expensive and much more effective!

More details are available.

—Trey


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation