Hi Shubham,

Thanks for the pointer to Naive Bayes!

If you want to know more about TextCat, the paper it is based on it here:

http://odur.let.rug.nl/vannoord/TextCat/textcat.pdf

The basic idea is to sort n-grams by frequency, then compare the rank order of the n-grams against the profile for a given language. One point is added for every position the rank orders disagree for each n-gram, and low score wins. (I've updated my write up with this information; thanks for highlighting the oversight.)

I'm familiar with Naive Bayes, but haven't looked into using it for language identification. This paper is the top result on Google for naive bayes language identification:

http://www.disi.unige.it/person/MascardiV/Download/ICAART2011c.pdf

After skimming it, it looks like they had to violate a lot of the proper assumptions of Naive Bayes to improve their results—like using the square root of the confidence or the con^(1/log(1+size)) hack, and smoothing for unknown n-grams. Seems like Naive Bayes is too naive, since n-grams are really, really not independent, esp. for short strings. Not that I'm against formula hacking—it's a hobby of mine.

I also like the idea of using priors to influence the outcome—e.g., assuming most queries enwiki are in English, most queries on frwiki are French, etc. That might temper some of the problems with more English queries being identified as French than French queries were.

Right now the goal is to do better than the existing Elastic Search language plugin, with something very lightweight. I'm not sure how much more effort we're going to put into this in the short term—though I love working on it—but I've added the Naive Bayes paper to my ever-growing list of references for language identification.

I'd also love to encourage folks—like you!—to try out other algorithms for language identification. Unfortunately, right now, I can't release the evaluation set I'm using because queries are potentially personally identifiable information ("PII") With the volume of traffic Wikipedia gets, it's unavoidable that someone accidentally or purposefully searches for their own name, phone number, email, physical address, Social Security Number or similar, etc.—or someone else's! The WMF tried to release some data in 2012, but had to take it down because of this problem:

http://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/

Since I've manually tagged everything in the evaluation data set (it's fairly small), I'll look into releasing the languagey bits of it (that's what we use for evaluation) and skipping everything that looks like a name, and other non-language categories. That may take a while, though.

Unfortunately, the raw training data can't be released, because it isn't possible to make sure there is no PII in millions of queries. The top 5,000 n-grams in that data are available for several languages, though, as language models in Stas's PHP version of my upgrades to TextCat:

https://github.com/smalyshev/textcat/tree/master/LM

Cheers,

—Trey

Trey Jones

Software Engineer, Discovery
Wikimedia Foundation

On Wed, Dec 16, 2015 at 12:04 AM, Shubham Singh Tomar <tomarshubham24@gmail.com> wrote:

Hi Trey,

Nice write up! Learnt a lot.

I'm not sure what algorithm TextCat uses for language detection. But, did you try using Naive Bayes ? Naive Bayes works great for text classification problems (No. of features (n-grams) >> No. of classes). Most of the pre-processing steps will remain the same as you documented. And, the algorithm is implemented in most of the popular NLP libraries in Python, R, Java etc.
Let me know if you haven't already tried this. I'll be happy to contribute.

Thanks,
Shubham

On Wed, Dec 16, 2015 at 2:12 AM, Trey Jones <tjones@wikimedia.org> wrote:
Hi everyone,

I've mostly finished up my write up on using TextCat for language detection, here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat

Highlights include:
- Using query data for training gives better results than using general text (i.e., Wikipedia articles) even though the query data is pretty messy.
- French and English are too similar, esp. since no one uses the diacritics in French. Blame the Normans (for the similarity, at least).
- We need to expand to other wikis (this focused on enwiki).
- We might be able to better, still, with better training data.

Thanks for the feedback so far!
—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Tue, Dec 8, 2015 at 10:22 AM, Trey Jones <tjones@wikimedia.org> wrote:
Hey everyone,

I originally started this thread on the internal mailing list, but I should've shared it on the public list. There have been a few replies, so I'll try to summarize everything so far and we can continue discussion here. For those following the old thread, new stuff starts at "New Stuff!" below. Sorry for the mixup.

I wanted to share some good news on the language detection front. I've been messing around with TextCat, an n-gram-based language ID tool that's been around for a good while. (The version I've been using is in Really Old Perl, but as David pointed out, it's been ported to other languages (see below) and it would be easy for us to port to Python or PHP.)

I trained fresh models on 59 languages based on queries (not random non-query text, but fairly messy data), and made some improvements to the code (making it Unicode-aware, making it run much faster, and improving the "unknown n-gram" penalty logic) and ran lots of tests with different model sizes. I got decent improvements over baseline TextCat.

After doing some analysis on my English Wikipedia sample, I limited the languages to those that are really good (Thai pretty much only looks like Thai) and really necessary (English, or course, and Spanish). I dropped languages that performed poorly and wouldn't help much anyway (Igbo doesn't come up a lot on enwiki), and dropped some nice-to-have languages that performed too poorly (French, alas). I got down to 15 relevant languages for enwiki: Arabic, Bulgarian, Bengali, Greek, English, Spanish, Farsi, Hindi, Japanese, Korean, Portuguese, Russian, Tamil, Thai, and Chinese.

The improvement over the ES plugin baseline (with spaces) is phenomenal. Recall doubled, and precision went up by a third. F0.5 is my preferred measure, but all these are waaaay better:

f0.5 f1 f2 recall prec
ES-sp 54.4% 47.4% 41.9% 39.0% 60.4%
ES-thr 69.5% 51.6% 41.1% 36.1% 90.3%
TC-lim 83.3% 83.3% 83.3% 83.4% 83.2%

• ES-sp is the ES-plugin with spaces added at the beginning and ending of each string.

• ES-thr is the ES-plugin (with spaces) with optimized thresholds for each language (more fine grained than choosing languages, and over trained because the optimization data and test data is the same, and brittle because the sample for many languages is very small). Also, this method targets precision, and in theory could improve recall a tiny bit, but in practice took it down a bit.

• TC-lim is TextCat, limiting the languages to the relevant 15 (instead of all 59).

I still have some other things to try (equalizing training set sizes, manually selected training sets, trying wiki-text training sets), and a more formal write up is coming, so this is just a preview for now.

Generally, I'm just really glad to get some improved results!

If we want to try an A/B test with this, we'll also need to turn it into something ES can use. Stas noted that it doesn't have to be hosted by ES - since we have a separate call to language detection (which goes to ES now, but can be anything), it can be implemented separately if we wanted to.

David found several other implementations in Java and PHP:

There are many Textcat implementations:
2 JAVA: http://textcat.sourceforge.net/ and http://www.jedi.be/pages/JTextCat/
1 PHP extension: https://pecl.php.net/package/TextCat
and certainly others

I'm wondering why the PHP implementation is done via a C extension... Maybe the initialisation steps (parsing ngrams stat files) are not suited for PHP scripts?

David also pointed out that TextCat uses somewhat the same technique used by Cybozu (ES plugin), so maybe we could try to train it with the same dataset used for TextCat and generate custom profiles. We could reuse the same PHP code in Cirrus. There's a question so whether it will be possible to implement the new "unknown n-gram" penalty logic. And unfortunately the es-plugin does not support loading multiple profiles making it hard to run A/B tests. So we will have to write some code anyways...

And David noted that solutions based on scripts will suffer from loading the ngram files for each request, and asked about overhead for the current models.

Erik noted that if we are going to rewrite it in something, a PHP library would make the most sense for reuse in our application (being PHP and all). Otherwise, we could tie the existing Perl together with a Perl based HTTP server and set it up somewhere in the cluster, perhaps the services misc cluster would make sense. The content translation team was also interested in the work we are doing with language detection and may find a microservice more accessible than a PHP library.

As Erik pointed out, the proof is in the pudding, so we'll have to see how this works out when we start sending live user data to it.

New Stuff!

• Stas pointed out that it's surprising that Bulgarian is on the short list because it's so similar to Russian. Actually Bulgarian, Spanish, and Portuguese aren't great (40%-55% F0.5 score) but they weren't obviously actively causing problems, like French and Igbo were.

• I should have gone looking for newer implementations of TextCat, like David did. It is pretty simple code. But that also means that using and modifying another implementation or porting our own should be easy. The unknown n-gram penalty fix was pretty small—using the model size instead of the incoming sample size as the penalty. (more detail on that with my write up.)

• I'm not 100% sure whether using actual queries as training helped (I thought it would, which is why I did it), but the training data is still pretty messy, so depending on the exact training method, retraining Cybozu could be doable. The current ES plugin was a black box to me—I didn't even know it was Cybozu. Anyone know where the code lives, or want to volunteer to figure out how to retrain it? (Or, additionally, turning off the not-so-useful models within it for testing on enwiki.)

• I modified TextCat to load all the models at once and hold them in memory (the previous implementation was noted in the code to be terrible for line-by-line processing because it would load each model individually for each query processed—it was the simplest hack possible to make it run in line-by-line mode). The overhead isn't much. 3.2MB for 59 models with 5000 n-grams. Models seem to be around 70KB, but range from 45K to 80K. The 15 languages I used are less then 1MB. Right now, a model with 3000 n-grams seems to be the best, so the overhead would go down by ~40% (not all n-grams are the same size, so it might be more). In short, not too much overhead.

• I think it would make sense to set this up as something that can keep the models in memory. I don't know enough about our PHP architecture to know if you can init a plugin and then keep it in memory for the duration. Seems plausible though. A service of some sort (doesn't have to be Perl-based) would also work. We need to think through the architectural bits.

More questions and comments are welcome! I'm going to spend a few more days trying out other variations on models and training data now that I have an evaluation pipeline set up.

There's also some work involved in using this on other wikis. The best results come from using the detectors that are most useful for the given wiki. For example, there really is no point in detecting Igbo on enwiki because Igbo queries are very rare, and Igbo incorrectly grabs a percentage of English queries, and it's extra overhead to use it. On the Igbo wiki—which made it onto my radar for getting 40K+ queries in a day—an Igbo model would obviously be useful. (A quick check just now also uncovered the biggest problem—many if not most queries to the Igbo wiki are in English. Cleaner training data will definitely be needed for some languages.)

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery

--
Thanks,
Shubham Singh Tomar
Autodidact24.github.io

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery