Hi Everyone,

What started out as an attempt to derive useful confidence measures for language identification (with TextCat) turned into a generalized improvement effort. We still don't have useful external confidence measures—though there's a little work yet to be done there (T149323, T155670). However, I did get a sizable improvement to the F0.5 accuracy scores by improving TextCat internals that don't really generalize to externally useful measures. The result was a mean improvement of just under 5% across the corpora from nine Wikipedias. The two worst performing corpora, enwiki and nlwiki, each went up around 10%! All nine are now above 90% F0.5 score.

You can read the final summary and recommendations, or read the rest of the page, too, if you want to know more about the whole odyssey, or if you have trouble sleeping. ; )

Next steps for language identification are to get these changes deployed, and then to look at other measures of confidence, and/or extend language identification to more wikis, though the latter two may take a backseat to working on new and improved language analyzers for the rest of this quarter.

—Trey

Trey Jones

Software Engineer, Discovery
Wikimedia Foundation