[discovery] Cybozu / ES Plugin language detection update

23 Dec 2015


      Hey Everyone,
David figured out how the Cybozu ES language detection plugin works in more
detail, and figured out how to limit languages and how to retrain the
models.
The results are big improvements that bring performance more in line with
the results we're seeing from TextCat.
Initial results are below, for queries with spaces appended before and
after (which improved performance on the old models—I'll verify that's
still the case).
Below are the summary stats for the all old language models, the old models
limited to "useful" languages, and new models, retrained on the (admittedly
messy) query data used for TextCat training. The evaluation set is the
manually tagged enwiki sample.
The full details will be posted on this page shortly:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...
All langauges, old models
f0.5    f1      f2      recall  prec   total   hits    misses
54.4%   47.4%   41.9%   39.0%   60.4%  775     302     198
Limited languages, old models
(en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5    f1      f2      recall  prec   total   hits    misses
75.6%   71.0%   67.0%   64.5%   79.0%  775     500     133
Retrained languages (en,es,zh,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5    f1      f2      recall  prec   total   hits    misses
81.8%   79.2%   76.9%   75.4%   83.5%  775     584     115
David suggests that this means we should go with TextCat, since it's easier
to integrate, and I agree. However, this test was pretty quick and easy to
run, so if we improve the training data, we can easily rebuild the models
and test again.
Overall, it's clear that limiting languages to the "useful" ones for a
given wiki makes sense, and training on query data rather than generic
language data helps, too!
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

[discovery] Cybozu / ES Plugin language detection update