Hey Stas,
To follow up on a couple of discussion threads and other ideas (and
probably miss one or two):
I’m not sure what the best way to represent the language models is.
Internally, TextCat cares about the rank (1st, 2nd, 3rd) not the raw
counts. It’s nice to have the actual raw n-gram counts available, because
it tells you something about the language model and how reliable it is.
(Like, are n-gram 1316 and 1317 in that order because they have different
counts, or because they have the same count and were sub-sorted
alphabetically?) On the other hand, it isn’t strictly necessary.
What’s the overhead for having the n-grams in an external file? That makes
it easy to update them and keep them under version control. I don’t have
strong opinions about using a simple tab- or comma-separated format vs in
PHP array format like you have. Either is easy enough to work with (though
tab-separated would mean they could be swapped between versions easily).
I have built additional language models beyond the ones you have put in
GitHub[1], but I’m not sure about their quality. Some I know are not great,
others are just hard (confusing Romance languages on short strings isn’t
surprising), but we don’t yet have general test data for evaluating them,
so I only shared the ones that aren’t completely horrible on enwiki query
data.
My other big concern is testing,[2] to make sure that everything works as
expected compared to the reference implementation in Perl. We should build
some models—training data is queries, which is treated as PII, so it’s on
stats1002 but not publicly available. I still need to clean up over there
and put it in a standard place.[3] We should run some IDs and maybe check
internal scores, or at least the ordering of languages returned. (Ties are
possible, so that could cause inconsistency in the results, but that should
be rare.) When you are ready to test, let me know and I’ll be happy to
pitch in.
So, is there anything else I can help with or try to clarify?
Thanks for taking charge on this!
—Trey
P.S.: Since the holidays are coming up pretty quick and not much will get
done through the end of the year, we should plan some time to hack on this
while I'm in SF in January.
[1]
https://github.com/smalyshev/textcat
[2]
https://phabricator.wikimedia.org/T121538
[3] At the moment, stats1002: /home/tjones/lang_id_training/input.filtered
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Mon, Dec 14, 2015 at 3:40 PM, Stas Malyshev <smalyshev(a)wikimedia.org>
wrote:
Hi!
Since we've talked about maybe using TextCat-based algorithms, I've made
an implementation of textcat as PHP class/utility, which may be useful:
https://github.com/smalyshev/textcat
Please feel free to comment. It bases on what I found at
http://odur.let.rug.nl/~vannoord/TextCat/ which is pretty old, so we may
want to patch it up, but it works as a starting point I think (provided
we'd want to pursue this route).
I'll work on improving the loading latency (converting LM format to PHP)
and making it into a real composer module. Maybe also add some tests.
Improvement suggestions welcome of course.
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery