Yay! Thank you for this awesome research, Trey. Evaluating language plugins sounds like it would make a /great/ blog post. What alternatives are up next?
On 4 September 2015 at 18:45, Trey Jones tjones@wikimedia.org wrote:
I've written up my analysis of the ElasticSearch language detection plugin that Erik recently enabled:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...
The short version is that it really likes Romanian (and Italian, and has a bit of a thing for French), and precision on English is great, but recall is poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.
I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much of a difference it makes).
Moderately pretty pictures included.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search