Hey Stas,
To follow up on a couple of discussion threads and other ideas (and probably miss one or two):
I’m not sure what the best way to represent the language models is. Internally, TextCat cares about the rank (1st, 2nd, 3rd) not the raw counts. It’s nice to have the actual raw n-gram counts available, because it tells you something about the language model and how reliable it is. (Like, are n-gram 1316 and 1317 in that order because they have different counts, or because they have the same count and were sub-sorted alphabetically?) On the other hand, it isn’t strictly necessary.
What’s the overhead for having the n-grams in an external file? That makes it easy to update them and keep them under version control. I don’t have strong opinions about using a simple tab- or comma-separated format vs in PHP array format like you have. Either is easy enough to work with (though tab-separated would mean they could be swapped between versions easily).
I have built additional language models beyond the ones you have put in GitHub[1], but I’m not sure about their quality. Some I know are not great, others are just hard (confusing Romance languages on short strings isn’t surprising), but we don’t yet have general test data for evaluating them, so I only shared the ones that aren’t completely horrible on enwiki query data.
My other big concern is testing,[2] to make sure that everything works as expected compared to the reference implementation in Perl. We should build some models—training data is queries, which is treated as PII, so it’s on stats1002 but not publicly available. I still need to clean up over there and put it in a standard place.[3] We should run some IDs and maybe check internal scores, or at least the ordering of languages returned. (Ties are possible, so that could cause inconsistency in the results, but that should be rare.) When you are ready to test, let me know and I’ll be happy to pitch in.
So, is there anything else I can help with or try to clarify?
Thanks for taking charge on this! —Trey
P.S.: Since the holidays are coming up pretty quick and not much will get done through the end of the year, we should plan some time to hack on this while I'm in SF in January.
[1] https://github.com/smalyshev/textcat [2] https://phabricator.wikimedia.org/T121538 [3] At the moment, stats1002: /home/tjones/lang_id_training/input.filtered
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Dec 14, 2015 at 3:40 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Since we've talked about maybe using TextCat-based algorithms, I've made an implementation of textcat as PHP class/utility, which may be useful:
https://github.com/smalyshev/textcat
Please feel free to comment. It bases on what I found at http://odur.let.rug.nl/~vannoord/TextCat/ which is pretty old, so we may want to patch it up, but it works as a starting point I think (provided we'd want to pursue this route).
I'll work on improving the loading latency (converting LM format to PHP) and making it into a real composer module. Maybe also add some tests. Improvement suggestions welcome of course. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery