Re: [discovery] TextCat in PHP

22 Dec 2015

Hi!

...
  What’s the overhead for having the n-grams in an
external file? That
 makes it easy to update them and keep them under version control. I
 don’t have strong opinions about using a simple tab- or comma-separated
 format vs in PHP array format like you have. Either is easy enough to
 work with (though tab-separated would mean they could be swapped between
 versions easily). 
I chose PHP format mainly because it's the fastest way to get data into
PHP. Converting between TSV and PHP is very easy - lm2php tool converts
one way and the other way if we ever want it is similarly trivial.

...
  I have built additional language models beyond the
ones you have put in
 GitHub[1], but I’m not sure about their quality. Some I know are not
 great, others are just hard (confusing Romance languages on short
 strings isn’t surprising), but we don’t yet have general test data for
 evaluating them, so I only shared the ones that aren’t completely
 horrible on enwiki query data. 
I think we need to think of some way of keeping these modules around
that aren't in the source of TextCat, and would be easily modifiable if
we want to experiment with them. Not sure yet what is the right place
for that.

...
  My other big concern is testing,[2] to make sure that
everything works
 as expected compared to the reference implementation in Perl. We should
 build some models—training data is queries, which is treated as PII, so
 it’s on stats1002 but not publicly available. I still need to clean up
 over there and put it in a standard place.[3] We should run some IDs and 
I think for this the textcat repo is the good place. I've already
started doing some unit tests there, and we can have the "standard"
models and maybe some example queries there. Right now unit tests check
that the model identifies the lines from the original file as their
original language. We may add more tests like that.

...
  P.S.: Since the holidays are coming up pretty quick
and not much will
 get done through the end of the year, we should plan some time to hack
 on this while I'm in SF in January. 
Surely. I will also start working on bringing this into the wikimedia
framework - making the repo on gerrit, cleaning it up for security
review, testing integration with Cirrus, etc.

-- 
Stas Malyshev
smalyshev(a)wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] TextCat in PHP