Hi!
Since we've talked about maybe using TextCat-based algorithms, I've made an implementation of textcat as PHP class/utility, which may be useful:
https://github.com/smalyshev/textcat
Please feel free to comment. It bases on what I found at http://odur.let.rug.nl/~vannoord/TextCat/ which is pretty old, so we may want to patch it up, but it works as a starting point I think (provided we'd want to pursue this route).
I'll work on improving the loading latency (converting LM format to PHP) and making it into a real composer module. Maybe also add some tests. Improvement suggestions welcome of course.
Hey Stas,
To follow up on a couple of discussion threads and other ideas (and probably miss one or two):
I’m not sure what the best way to represent the language models is. Internally, TextCat cares about the rank (1st, 2nd, 3rd) not the raw counts. It’s nice to have the actual raw n-gram counts available, because it tells you something about the language model and how reliable it is. (Like, are n-gram 1316 and 1317 in that order because they have different counts, or because they have the same count and were sub-sorted alphabetically?) On the other hand, it isn’t strictly necessary.
What’s the overhead for having the n-grams in an external file? That makes it easy to update them and keep them under version control. I don’t have strong opinions about using a simple tab- or comma-separated format vs in PHP array format like you have. Either is easy enough to work with (though tab-separated would mean they could be swapped between versions easily).
I have built additional language models beyond the ones you have put in GitHub[1], but I’m not sure about their quality. Some I know are not great, others are just hard (confusing Romance languages on short strings isn’t surprising), but we don’t yet have general test data for evaluating them, so I only shared the ones that aren’t completely horrible on enwiki query data.
My other big concern is testing,[2] to make sure that everything works as expected compared to the reference implementation in Perl. We should build some models—training data is queries, which is treated as PII, so it’s on stats1002 but not publicly available. I still need to clean up over there and put it in a standard place.[3] We should run some IDs and maybe check internal scores, or at least the ordering of languages returned. (Ties are possible, so that could cause inconsistency in the results, but that should be rare.) When you are ready to test, let me know and I’ll be happy to pitch in.
So, is there anything else I can help with or try to clarify?
Thanks for taking charge on this! —Trey
P.S.: Since the holidays are coming up pretty quick and not much will get done through the end of the year, we should plan some time to hack on this while I'm in SF in January.
[1] https://github.com/smalyshev/textcat [2] https://phabricator.wikimedia.org/T121538 [3] At the moment, stats1002: /home/tjones/lang_id_training/input.filtered
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Dec 14, 2015 at 3:40 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Since we've talked about maybe using TextCat-based algorithms, I've made an implementation of textcat as PHP class/utility, which may be useful:
https://github.com/smalyshev/textcat
Please feel free to comment. It bases on what I found at http://odur.let.rug.nl/~vannoord/TextCat/ which is pretty old, so we may want to patch it up, but it works as a starting point I think (provided we'd want to pursue this route).
I'll work on improving the loading latency (converting LM format to PHP) and making it into a real composer module. Maybe also add some tests. Improvement suggestions welcome of course. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
What’s the overhead for having the n-grams in an external file? That makes it easy to update them and keep them under version control. I don’t have strong opinions about using a simple tab- or comma-separated format vs in PHP array format like you have. Either is easy enough to work with (though tab-separated would mean they could be swapped between versions easily).
I chose PHP format mainly because it's the fastest way to get data into PHP. Converting between TSV and PHP is very easy - lm2php tool converts one way and the other way if we ever want it is similarly trivial.
I have built additional language models beyond the ones you have put in GitHub[1], but I’m not sure about their quality. Some I know are not great, others are just hard (confusing Romance languages on short strings isn’t surprising), but we don’t yet have general test data for evaluating them, so I only shared the ones that aren’t completely horrible on enwiki query data.
I think we need to think of some way of keeping these modules around that aren't in the source of TextCat, and would be easily modifiable if we want to experiment with them. Not sure yet what is the right place for that.
My other big concern is testing,[2] to make sure that everything works as expected compared to the reference implementation in Perl. We should build some models—training data is queries, which is treated as PII, so it’s on stats1002 but not publicly available. I still need to clean up over there and put it in a standard place.[3] We should run some IDs and
I think for this the textcat repo is the good place. I've already started doing some unit tests there, and we can have the "standard" models and maybe some example queries there. Right now unit tests check that the model identifies the lines from the original file as their original language. We may add more tests like that.
P.S.: Since the holidays are coming up pretty quick and not much will get done through the end of the year, we should plan some time to hack on this while I'm in SF in January.
Surely. I will also start working on bringing this into the wikimedia framework - making the repo on gerrit, cleaning it up for security review, testing integration with Cirrus, etc.
This is also on Phab[0] for longer-term documentation, but I'm copying it here for wider distribution.
—Trey
*TextCat and Language Detection*
Back before the holidays (12/23/2015), Stas and Trey had a conversation on IRC about TextCat and Lang ID. There was lots of good stuff in the conversation, so the main points are summarized here, to record for posterity, and to open them up to further conversation if anyone has any additional ideas.
For reference, the main Phab ticket for language ID stuff is T118278: EPIC: Improve Language Identification for use in Cirrus Search[1]
*Building Language Models:* It seems like we should try to create language models to cover at least the same set of languages as the original TextCat. The original models were in various encodings, but we’d create (and have created) models in Unicode. In general, we saw better performance doing language detection on queries using models built on queries.[2] If we want to support general language identification, we could also build models based on text from Wikipedia (which we need to do for some languages anyway because the query data is so poor).[3] It’s a relatively straightforward task, compared to getting sufficiently high quality query data.[4]
*Using Language Models:* We get the biggest improvement in language detection accuracy (~20% increase in F0.5) from restricting the list of candidate languages based on their individual performance and the distribution of languages we encounter in real life, rather than using all available languages.[2][7] We need our new TextCat to support the ability to specify which models to use.[5] It makes sense to create models based on both query data (if we have it) and general text (from Wikipedia) and make them available, probably through Stas’s PHP version of TextCat on GitHub.[6] Trey will also be putting the Perl version and language models up on GitHub after a bit more cleanup.
*Choosing Language Models:* In order to choose which models to use on a particular wiki, we need to sample queries and manually identify the languages represented, and then experimentally determine the best set of language models to use.[8] We will do this for the wikis with the highest query volume, and see how far down the list we have time to work on. For any wikis we don’t get to, we can try using a generic set of languages, or just not do language detection for now, or make general capabilities available as an opt-in feature—though we need to think more carefully about how to handle smaller wikis, especially after we have more experience using TextCat on larger wikis.
In addition to evaluation sets for particular wikis, we’re have a task[9] to create a “balanced” set of queries in known languages for top wikis (by query volume) for general evaluation of language models, which can help us determine a generic set of more-or-less reliable languages. (These are smaller sets that let us gauge general performance, but not enough for training language models.)
*Updating Language Model Choices: *Trey’s estimate/intuition (which could use some validation) is that the per-wiki language lists would need updating at most once a quarter, though it’s possible that with appropriate metrics we could determine that we needed to do an update by a sudden or sustained gradual decrease in performance. We may need to think this through a bit more carefully, since different update pattern imply different places/ways to store the list of relevant language models. Stas says that quarterly updates are close enough to static to put language lists into some file in the Cirrus source, pretty much like we do with indexing profiles, etc. Alternatively, if updates are more frequent and per-wiki, we could store the list of languages to use in mediawiki-config.
[0] https://phabricator.wikimedia.org/T118278#1919183 [1] https://phabricator.wikimedia.org/T118278 [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_wi... https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat [3] https://phabricator.wikimedia.org/T121545 [4] https://phabricator.wikimedia.org/T121547, etc. See [1] for more. [5] https://phabricator.wikimedia.org/T121538 [6] https://github.com/smalyshev/textcat [7] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev... [8] https://phabricator.wikimedia.org/T121541 [9] https://phabricator.wikimedia.org/T121539
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Tue, Dec 22, 2015 at 1:27 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
What’s the overhead for having the n-grams in an external file? That makes it easy to update them and keep them under version control. I don’t have strong opinions about using a simple tab- or comma-separated format vs in PHP array format like you have. Either is easy enough to work with (though tab-separated would mean they could be swapped between versions easily).
I chose PHP format mainly because it's the fastest way to get data into PHP. Converting between TSV and PHP is very easy - lm2php tool converts one way and the other way if we ever want it is similarly trivial.
I have built additional language models beyond the ones you have put in GitHub[1], but I’m not sure about their quality. Some I know are not great, others are just hard (confusing Romance languages on short strings isn’t surprising), but we don’t yet have general test data for evaluating them, so I only shared the ones that aren’t completely horrible on enwiki query data.
I think we need to think of some way of keeping these modules around that aren't in the source of TextCat, and would be easily modifiable if we want to experiment with them. Not sure yet what is the right place for that.
My other big concern is testing,[2] to make sure that everything works as expected compared to the reference implementation in Perl. We should build some models—training data is queries, which is treated as PII, so it’s on stats1002 but not publicly available. I still need to clean up over there and put it in a standard place.[3] We should run some IDs and
I think for this the textcat repo is the good place. I've already started doing some unit tests there, and we can have the "standard" models and maybe some example queries there. Right now unit tests check that the model identifies the lines from the original file as their original language. We may add more tests like that.
P.S.: Since the holidays are coming up pretty quick and not much will get done through the end of the year, we should plan some time to hack on this while I'm in SF in January.
Surely. I will also start working on bringing this into the wikimedia framework - making the repo on gerrit, cleaning it up for security review, testing integration with Cirrus, etc.
-- Stas Malyshev smalyshev@wikimedia.org
_______________________________________________ discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery