Hi Everyone,
What started out as an attempt to derive useful confidence measures for language identification (with TextCat https://www.mediawiki.org/wiki/TextCat) turned into a generalized improvement effort. We still don't have useful external confidence measures—though there's a little work yet to be done there (T149323 https://phabricator.wikimedia.org/T149323, T155670 https://phabricator.wikimedia.org/T155670). However, I did get a sizable improvement to the F0.5 https://en.wikipedia.org/wiki/F1_score accuracy scores by improving TextCat internals that don't really generalize to externally useful measures. The result was a mean improvement of just under 5% across the corpora from nine Wikipedias. The two worst performing corpora, enwiki and nlwiki, each went up around 10%! All nine are now above 90% F0.5 score.
You can read the final summary and recommendations https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Improvements#Final_Summary_.26_Recommendations, or read the rest of the page, too, if you want to know more about the whole odyssey, or if you have trouble sleeping. ; )
Next steps for language identification are to get these changes deployed, and then to look at other measures of confidence, and/or extend language identification to more wikis, though the latter two may take a backseat to working on new and improved language analyzers https://phabricator.wikimedia.org/T154511 for the rest of this quarter.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation