Language Identification Updates - Discovery

19 Jan 2017


      Hi Everyone,
What started out as an attempt to derive useful confidence measures for
language identification (with TextCat
https://www.mediawiki.org/wiki/TextCat) turned into a generalized
improvement effort. We still don't have useful external confidence
measures—though there's a little work yet to be done there (T149323
https://phabricator.wikimedia.org/T149323, T155670
https://phabricator.wikimedia.org/T155670). However, I did get a sizable
improvement to the F0.5 https://en.wikipedia.org/wiki/F1_score accuracy
scores by improving TextCat internals that don't really generalize to
externally useful measures. The result was a mean improvement of just under
5% across the corpora from nine Wikipedias. The two worst performing
corpora, enwiki and nlwiki, each went up around 10%! All nine are now above
90% F0.5 score.
You can read the final summary and recommendations
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Improvements#Final_Summary_.26_Recommendations,
or read the rest of the page, too, if you want to know more about the whole
odyssey, or if you have trouble sleeping. ; )
Next steps for language identification are to get these changes deployed,
and then to look at other measures of confidence, and/or extend language
identification to more wikis, though the latter two may take a backseat to
working on new and improved language analyzers
https://phabricator.wikimedia.org/T154511 for the rest of this quarter.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation