I've written up my analysis of the ElasticSearch language
detection plugin that Erik recently enabled:
The short version is that it really likes Romanian (and
Italian, and has a bit of a thing for French), and precision
on English is great, but recall is poor (probably because of
all the typos and other crap that go to enwiki that is still
technically "English"). Chinese and Arabic are good.
I think we could do better, and we should evaluate (a)
other language detectors and (b) the effect of a good language
detector on zero results rate (i.e., simulate sending queries
to the right place and see how much of a difference it makes).
Moderately pretty pictures included.
—Trey
Trey
Jones
Software Engineer, Discovery
Wikimedia Foundation