Hey everyone,

I originally started this thread on the internal mailing list, but I should've shared it on the public list. There have been a few replies, so I'll try to summarize everything so far and we can continue discussion here. For those following the old thread, new stuff starts at "New Stuff!" below. Sorry for the mixup.

[snipped because mailman will refuse my mail]

The improvement over the ES plugin baseline (with spaces) is phenomenal. Recall doubled, and precision went up by a third. F0.5 is my preferred measure, but all these are waaaay better:

f0.5 f1 f2 recall prec

ES-sp 54.4% 47.4% 41.9% 39.0% 60.4%

ES-thr 69.5% 51.6% 41.1% 36.1% 90.3%

TC-lim 83.3% 83.3% 83.3% 83.4% 83.2%

New Stuff!

• Stas pointed out that it's surprising that Bulgarian is on the short list because it's so similar to Russian. Actually Bulgarian, Spanish, and Portuguese aren't great (40%-55% F0.5 score) but they weren't obviously actively causing problems, like French and Igbo were.

• I should have gone looking for newer implementations of TextCat, like David did. It is pretty simple code. But that also means that using and modifying another implementation or porting our own should be easy. The unknown n-gram penalty fix was pretty small—using the model size instead of the incoming sample size as the penalty. (more detail on that with my write up.)

• I'm not 100% sure whether using actual queries as training helped (I thought it would, which is why I did it), but the training data is still pretty messy, so depending on the exact training method, retraining Cybozu could be doable. The current ES plugin was a black box to me—I didn't even know it was Cybozu. Anyone know where the code lives, or want to volunteer to figure out how to retrain it? (Or, additionally, turning off the not-so-useful models within it for testing on enwiki.)

• I think it would make sense to set this up as something that can keep the models in memory. I don't know enough about our PHP architecture to know if you can init a plugin and then keep it in memory for the duration. Seems plausible though. A service of some sort (doesn't have to be Perl-based) would also work. We need to think through the architectural bits.