Language Identification Updates - Discovery

19 Jan 2017

Hi Everyone,

What started out as an attempt to derive useful confidence measures for
language identification (with TextCat
<https://www.mediawiki.org/wiki/TextCat>) turned into a generalized
improvement effort. We still don't have useful external confidence
measures—though there's a little work yet to be done there (T149323
<https://phabricator.wikimedia.org/T149323>, T155670
<https://phabricator.wikimedia.org/T155670>). However, I did get a sizable
improvement to the F0.5 <https://en.wikipedia.org/wiki/F1_score> accuracy
scores by improving TextCat internals that don't really generalize to
externally useful measures. The result was a mean improvement of just under
5% across the corpora from nine Wikipedias. The two worst performing
corpora, enwiki and nlwiki, each went up around 10%! All nine are now above
90% F0.5 score.

You can read the final summary and recommendations
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Improvements#Final_Summary_.26_Recommendations>,
or read the rest of the page, too, if you want to know more about the whole
odyssey, or if you have trouble sleeping. ; )

Next steps for language identification are to get these changes deployed,
and then to look at other measures of confidence, and/or extend language
identification to more wikis, though the latter two may take a backseat to
working on new and improved language analyzers
<https://phabricator.wikimedia.org/T154511> for the rest of this quarter.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation