Using language detection to search the right WikipediaWikipedia readers speak many languages, so it’s not a surprise that sometimes they search for phrases not in the language of the wiki that they’re currently reading. This, unfortunately, can lead to poor search results. A recent survey we completed on English Wikipedia identified searches done in 40 different languages [1]!The Wikimedia Discovery department [2] wants to help people easily find what they are looking for. In order to do this, the Discovery Search team is rolling out new language identification software to the Wikipedia search engine.
This new software will detect when a search is unsuccessful, but appears to be in a different language. When this happens, the search results page will include results from the Wikipedia of the automatically detected language. These new cross-wiki results will be displayed along with the local-wiki results, if there are any. We’ve recently enabled the language identification and search results for the English, French, German, Italian, and Spanish-language Wikipedias.The next group of Wikipedias to have language detection enabled will includeIndonesian, Japanese, Portuguese, and Russian [3]. We are investigating ways to bring language detection to more Wikipedias and to other Wikimedia projects.
The Search team has other language detection ideas and plans in the works. We’re thinking about ways to improve language detection with smarter measures of confidence [4]. We are also exploring detection of search in one character set while using a keyboard from another character set. Early experiments with English and Russian [5] are promising!You can find technical details about our new language detection module (TextCat) onMediaWiki.org [6]. PHP [7] and updated Perl [8] libraries are also available and the libraries include language models for dozens of languages.You can also test the language detection using our online demo [9]. The demo lets you try all the different language models on your own text. It also includes tutorials and lots of additional information about TextCat’s internal workings.
Let’s get searching - now with language detection and better results! You can read theblog post [10] and more detailed information is here [11].Here's some nice screenshots of what it looked like before we added in the language detection...[12]and after we added in the language detection for a Russian query on English Wikipedia [13]:Thanks for reading - from the Discovery Search Team Gnomes!--Deb TankersleyProduct Manager, DiscoveryIRC: debtWikimedia Foundation
_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery