What started out as an attempt to derive useful confidence measures for language identification (with
TextCat) turned into a generalized improvement effort. We still don't have useful external confidence measures—though there's a little work yet to be done there (
T149323,
T155670). However, I did get a sizable improvement to the
F0.5 accuracy scores by improving TextCat internals that don't really generalize to externally useful measures. The result was a mean improvement of just under 5% across the corpora from nine Wikipedias. The two worst performing corpora, enwiki and nlwiki, each went up around 10%! All nine are now above 90% F
0.5 score.
Next steps for language identification are to get these changes deployed, and then to look at other measures of confidence, and/or extend language identification to more wikis, though the latter two may take a backseat to working on new and improved
language analyzers for the rest of this quarter.