Using language detection to search the right Wikipedia
Wikipedia readers speak many languages, so it’s not a surprise that sometimes they search for phrases not in the language of the wiki that they’re currently reading. This, unfortunately, can lead to poor search results. A recent survey we completed on English Wikipedia identified searches done in 40 different languages https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimization_for_enwiki#Other_languages_searched_on_enwiki [1]!
The Wikimedia Discovery department http://www.mediawiki.org/wiki/Wikimedia_Discovery [2] wants to help people easily find what they are looking for. In order to do this, the Discovery Search team is rolling out new language identification software to the Wikipedia search engine.
This new software will detect when a search is unsuccessful, but appears to be in a different language. When this happens, the search results page will include results from the Wikipedia of the automatically detected language. These new cross-wiki results will be displayed along with the local-wiki results, if there are any. We’ve recently enabled the language identification and search results for the English, French, German, Italian, and Spanish-language Wikipedias.
The next group of Wikipedias to have language detection enabled will includeIndonesian, Japanese, Portuguese, and Russian http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_jawiki_and_idwiki [3]. We are investigating ways to bring language detection to more Wikipedias and to other Wikimedia projects.
The Search team has other language detection ideas and plans in the works. We’re thinking about ways to improve language detection with smarter measures of confidence https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confidence [4]. We are also exploring detection of search in one character set while using a keyboard from another character set. Early experiments with English and Russian https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard%E2%80%94Russian_and_English [5] are promising!
You can find technical details about our new language detection module (TextCat) onMediaWiki.org https://www.mediawiki.org/wiki/TextCat [6]. PHP https://github.com/wikimedia/wikimedia-textcat [7] and updated Perl https://github.com/Trey314159/TextCat [8] libraries are also available and the libraries include language models for dozens of languages.
You can also test the language detection using our online demo https://tools.wmflabs.org/textcatdemo/ [9]. The demo lets you try all the different language models on your own text. It also includes tutorials and lots of additional information about TextCat’s internal workings.
Let’s get searching - now with language detection and better results! You can read theblog post https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/ [10] and more detailed information is here https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Language.pdf [11].
*Here's some nice screenshots of what it looked like before we added in the language detection...[12]*
*and after we added in the language detection for a Russian query on English Wikipedia [13]:*
*Thanks for reading - from the Discovery Search Team Gnomes!*
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimizati... [2] http://www.mediawiki.org/wiki/Wikimedia_Discovery [3] http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_f... [4] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confidenc... [5] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_K... [6] https://www.mediawiki.org/wiki/TextCat [7] https://github.com/wikimedia/wikimedia-textcat [8] https://github.com/Trey314159/TextCat [9] https://tools.wmflabs.org/textcatdemo/ [10] https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/ [11] https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Langua... [12] https://commons.wikimedia.org/wiki/File%3AExisting-search_no-textcat.png [13] https://commons.wikimedia.org/wiki/File%3ANew-search_with-textcat.png
-- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation
wikitech-l@lists.wikimedia.org