Hey Everyone,
David figured out how the Cybozu ES language detection plugin works in more detail, and figured out how to limit languages and how to retrain the models.
The results are big improvements that bring performance more in line with the results we're seeing from TextCat.
Initial results are below, for queries with spaces appended before and after (which improved performance on the old models—I'll verify that's still the case).
Below are the summary stats for the all old language models, the old models limited to "useful" languages, and new models, retrained on the (admittedly messy) query data used for TextCat training. The evaluation set is the manually tagged enwiki sample.
The full details will be posted on this page shortly:
All langauges, old models
f0.5 f1 f2 recall prec total hits misses
54.4% 47.4% 41.9% 39.0% 60.4% 775 302 198
Limited languages, old models (en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5 f1 f2 recall prec total hits misses
75.6% 71.0% 67.0% 64.5% 79.0% 775 500 133
Retrained languages (en,es,zh,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5 f1 f2 recall prec total hits misses
81.8% 79.2% 76.9% 75.4% 83.5% 775 584 115
David suggests that this means we should go with TextCat, since it's easier to integrate, and I agree. However, this test was pretty quick and easy to run, so if we improve the training data, we can easily rebuild the models and test again.
Overall, it's clear that limiting languages to the "useful" ones for a given wiki makes sense, and training on query data rather than generic language data helps, too!
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation