Le 08/12/2015 16:22, Trey Jones a écrit :
Hey everyone,

I originally started this thread on the internal mailing list, but I should've shared it on the public list. There have been a few replies, so I'll try to summarize everything so far and we can continue discussion here. For those following the old thread, new stuff starts at "New Stuff!" below. Sorry for the mixup.

[snipped because mailman will refuse my mail]
The improvement over the ES plugin baseline (with spaces) is phenomenal. Recall doubled, and precision went up by a third. F0.5 is my preferred measure, but all these are waaaay better:

        f0.5    f1      f2      recall  prec
ES-sp   54.4%   47.4%   41.9%   39.0%   60.4%
ES-thr  69.5%   51.6%   41.1%   36.1%   90.3%
TC-lim  83.3%   83.3%   83.3%   83.4%   83.2%

These numbers are really impressive.


New Stuff!

• Stas pointed out that it's surprising that Bulgarian is on the short list because it's so similar to Russian. Actually Bulgarian, Spanish, and Portuguese aren't great (40%-55% F0.5 score) but they weren't obviously actively causing problems, like French and Igbo were.

• I should have gone looking for newer implementations of TextCat, like David did. It is pretty simple code. But that also means that using and modifying another implementation or porting our own should be easy. The unknown n-gram penalty fix was pretty small—using the model size instead of the incoming sample size as the penalty. (more detail on that with my write up.)

• I'm not 100% sure whether using actual queries as training helped (I thought it would, which is why I did it), but the training data is still pretty messy, so depending on the exact training method, retraining Cybozu could be doable. The current ES plugin was a black box to me—I didn't even know it was Cybozu. Anyone know where the code lives, or want to volunteer to figure out how to retrain it? (Or, additionally, turning off the not-so-useful models within it for testing on enwiki.)

If it makes sense I can try to build a profile. The purpose of this idea was mainly to reuse most of our code but I realize that it may require some work to adapt the ES plugin and it's hard to guess if it's worth the effort...
Code is here: https://github.com/shuyo/language-detection


• I think it would make sense to set this up as something that can keep the models in memory. I don't know enough about our PHP architecture to know if you can init a plugin and then keep it in memory for the duration. Seems plausible though. A service of some sort (doesn't have to be Perl-based) would also work. We need to think through the architectural bits.

Yes that was the purpose of my question concerning init time overhead. LM files seem to be already ordered so it would be 15 tsv files of ~3kb to read on each query. If we rewrite it in PHP we could maybe write the profiles as a PHP script directly (should be pretty small compared to the 500kb of InitialiaseSettings.php). But I'm no expert here.