I spent today looking at identifying and converting queries typed on the wrong keyboard on the English and Russian Wikipedias.
*Highlights*
Looking for mis-keyboarded queries in the "right" character set (ie., Latin on English Wikipedia or Cyrillic on Russian Wikipedia) can explain some gibberish queries and give some improvement in results, but it's very expensive because there are so many candidate queries.
Looking for mis-keyboarded queries in the "wrong" character set (ie., Cyrillic on English Wikipedia or Latin on Russian Wikipedia) can explain a lot of gibberish queries and give better results, especially on Russian Wikipedia, where possibly more than 1% of queries are accidentally typed on the wrong keyboard!
Limiting the scope to only zero-result queries or perhaps poorly performing (fewer than three results) queries could be computationally less expensive and much more effective!
More details are available https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard_Russian_and_English .
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Hi!
I spent today looking at identifying and converting queries typed on the wrong keyboard on the English and Russian Wikipedias.
That's a great idea to address that. I do it all the time :) Though, I use different keyboard лаыоут ("phonetic"[1] one, not "йцукен" [2] one) but that's because I write most of the time in English. People that write most of the time in Russian would probably use the traditional Russian one [2].
[1] https://ru.wikipedia.org/wiki/%D0%A4%D0%BE%D0%BD%D0%B5%D1%82%D0%B8%D1%87%D0%... - sorry, no enwiki article yet :) [2] https://en.wikipedia.org/wiki/JCUKEN
Interestingly enough, google when you search for дштгч (linux) does search for Linux, but duckduckgo not only searches for Linux but also presents as a first result Linux page of Russian wikipedia! I think it's pretty smart.
P.S. both google and ddg know to handle уoutube (with cyrillic first letter) too.