New Stuff!
• Stas pointed out that it's surprising that Bulgarian is
on the short list because it's so similar to Russian. Actually
Bulgarian, Spanish, and Portuguese aren't great (40%-55% F0.5
score) but they weren't obviously actively causing problems,
like French and Igbo were.
• I should have gone looking for newer implementations of
TextCat, like David did. It is pretty simple code. But that
also means that using and modifying another implementation or
porting our own should be easy. The unknown n-gram penalty fix
was pretty small—using the model size instead of the incoming
sample size as the penalty. (more detail on that with my write
up.)
• I'm not 100% sure whether using actual queries as
training helped (I thought it would, which is why I did it),
but the training data is still pretty messy, so depending on
the exact training method, retraining Cybozu could be doable.
The current ES plugin was a black box to me—I didn't even know
it was Cybozu. Anyone know where the code lives, or want to
volunteer to figure out how to retrain it? (Or, additionally,
turning off the not-so-useful models within it for testing on
enwiki.)