[Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

List overview All Threads
Download

newer

older

[Wikimedia-search] Temporary...

[Wikimedia-search] Fwd: Announcing...

Trey Jones

4 Sep 2015 4 Sep '15

10:45 p.m.

I've written up my analysis of the ElasticSearch language detection plugin that Erik recently enabled:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...

The short version is that it really likes Romanian (and Italian, and has a bit of a thing for French), and precision on English is great, but recall is poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much of a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

Attachments:

attachment.htm (text/html — 1.5 KB)

Show replies by date

Oliver Keyes

4 Sep 4 Sep

11:26 p.m.

Yay! Thank you for this awesome research, Trey. Evaluating language plugins sounds like it would make a /great/ blog post. What alternatives are up next?

On 4 September 2015 at 18:45, Trey Jones tjones@wikimedia.org wrote:

...

I've written up my analysis of the ElasticSearch language detection plugin that Erik recently enabled:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...

The short version is that it really likes Romanian (and Italian, and has a bit of a thing for French), and precision on English is great, but recall is poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much of a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Count Logula Wikimedia Foundation

Trey Jones

5 Sep 5 Sep

12:08 a.m.

Thanks, Oliver!

I'm not sure what's up next. We could look around for other available detectors, algorithms, or ideas to try. Fortunately we don't need to integrate them to test them—we can just run the queries and evaluate the results.

We could also try something of our own devising, because it's some combination of easier, better, faster, and good enough.

I'm open to suggestions. Next week I'll ask Dan & Erik about how much effort to put into alternatives.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

On Fri, Sep 4, 2015 at 7:26 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Yay! Thank you for this awesome research, Trey. Evaluating language plugins sounds like it would make a /great/ blog post. What alternatives are up next?

On 4 September 2015 at 18:45, Trey Jones tjones@wikimedia.org wrote:

...
I've written up my analysis of the ElasticSearch language detection

plugin

...
that Erik recently enabled:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...

...
The short version is that it really likes Romanian (and Italian, and has

a

...
bit of a thing for French), and precision on English is great, but

recall is

...
poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much

of

...
a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Count Logula Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Oliver Keyes

12:24 a.m.

Well, we have the implementation of Kolkus's algorithm in Java - although it's a training-based model so it'll need a known dataset to run off.

On 4 September 2015 at 20:08, Trey Jones tjones@wikimedia.org wrote:

...

Thanks, Oliver!

I'm not sure what's up next. We could look around for other available detectors, algorithms, or ideas to try. Fortunately we don't need to integrate them to test them—we can just run the queries and evaluate the results.

We could also try something of our own devising, because it's some combination of easier, better, faster, and good enough.

I'm open to suggestions. Next week I'll ask Dan & Erik about how much effort to put into alternatives.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

On Fri, Sep 4, 2015 at 7:26 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Yay! Thank you for this awesome research, Trey. Evaluating language plugins sounds like it would make a /great/ blog post. What alternatives are up next?

On 4 September 2015 at 18:45, Trey Jones tjones@wikimedia.org wrote:

...
I've written up my analysis of the ElasticSearch language detection plugin that Erik recently enabled:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...

The short version is that it really likes Romanian (and Italian, and has a bit of a thing for French), and precision on English is great, but recall is poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much of a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Count Logula Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Count Logula Wikimedia Foundation

Federico Leva (Nemo)

4:33 p.m.

Oliver Keyes, 05/09/2015 02:24:

...

Well, we have the implementation of Kolkus's algorithm in Java - although it's a training-based model so it'll need a known dataset to run off.

Niklas made a dataset for one of the available language detectors, using some millions translatewiki.net documents in hundreds languages: https://github.com/nemobis/LanguageDetector/commit/05040b7ec14b0c261fb6462a1... Cf. http://laxstrom.name/blag/2015/03/09/iwclul-33-conversations-and-ideas/

Nemo

Oliver Keyes

6:08 p.m.

Ooh, excellent! Thanks Nemo!

On 5 September 2015 at 12:33, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Oliver Keyes, 05/09/2015 02:24:

...
Well, we have the implementation of Kolkus's algorithm in Java - although it's a training-based model so it'll need a known dataset to run off.

Niklas made a dataset for one of the available language detectors, using some millions translatewiki.net documents in hundreds languages: https://github.com/nemobis/LanguageDetector/commit/05040b7ec14b0c261fb6462a1... Cf. http://laxstrom.name/blag/2015/03/09/iwclul-33-conversations-and-ideas/

Nemo

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Count Logula Wikimedia Foundation

David Causse

7 Sep 7 Sep

9:50 a.m.

Thanks!

this is awesome.

Concerning soburdia: the typo is in the first 2 chars so our misspelling identification will fail, searching for sucurbia properly displays "suburbia" as a "did you mean" suggestion. This was one the enhancement we tried to implement but we are currently blocked by a bug in elasticsearch. I hope it's not a common pattern because we'll add a second error with language detection...

Is it possible to identify how many queries are 1 one/2 words/3 words? I'm asking this question because there's another weakness in this language detector. Characters at word boundaries seems to bear some valuable informations concerning language features and the detector fails to make any benefit of them if it's a one word query. Running the detector with additional trailing spaces changed significantly the results.

For example граничащее (russian) Detecting "граничащее" returns bg at 0.99 But detecting " граничащее " returns ru at 0.57 and bg at 0.42

But in the end I agree with your analysis in "Stupid language detection". Mainly because the detector does not weight its results on the wiki size (ru should be weighted higher because ruwiki is larger than bgwiki) because it's what we are looking for. We're looking for results, we don't care too much about the actual language of the query.

Le 05/09/2015 00:45, Trey Jones a écrit :

...

I've written up my analysis of the ElasticSearch language detection plugin that Erik recently enabled: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev... https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Language_Detection_Evaluation

The short version is that it really likes Romanian (and Italian, and has a bit of a thing for French), and precision on English is great, but recall is poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much of a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Trey Jones

8 Sep 8 Sep

2:30 p.m.

There's a lot to catch up on, but some quick and easy stuff first, in response to David's comments.

For queries that are marked as "language" (775 queries), the distribution of token counts (word counts) up to 10 is below:

160 1 tokens 152 2 tokens 141 3 tokens 91 4 tokens 63 5 tokens 49 6 tokens 35 7 tokens 18 8 tokens 22 9 tokens 10 10 tokens

More detailed token count info, for all queries, language queries, and non-language queries, including longer queries (max 84 tokens) see [0].

I also quickly tested David's discovery that spaces help, and the short version is that it's worth a couple of percentage points in recall and precision, so it's an easy win. More details at [1].

And, just for grins, I scored the current default—assume everything is English—to see how that looks. Recall, precision, and F-Score are much better, but it doesn't help zero results rate (or general relevancy), since these are all queries that failed. So R&P aren't everything. Details at [2].

—Trey

[0] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev... [1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev... [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...

Trey Jones Software Engineer, Discovery Wikimedia Foundation

On Mon, Sep 7, 2015 at 5:50 AM, David Causse dcausse@wikimedia.org wrote:

...

Thanks!

this is awesome.

Concerning soburdia: the typo is in the first 2 chars so our misspelling identification will fail, searching for sucurbia properly displays "suburbia" as a "did you mean" suggestion. This was one the enhancement we tried to implement but we are currently blocked by a bug in elasticsearch. I hope it's not a common pattern because we'll add a second error with language detection...

Is it possible to identify how many queries are 1 one/2 words/3 words? I'm asking this question because there's another weakness in this language detector. Characters at word boundaries seems to bear some valuable informations concerning language features and the detector fails to make any benefit of them if it's a one word query. Running the detector with additional trailing spaces changed significantly the results.

For example граничащее (russian) Detecting "граничащее" returns bg at 0.99 But detecting " граничащее " returns ru at 0.57 and bg at 0.42

But in the end I agree with your analysis in "Stupid language detection". Mainly because the detector does not weight its results on the wiki size (ru should be weighted higher because ruwiki is larger than bgwiki) because it's what we are looking for. We're looking for results, we don't care too much about the actual language of the query.

Le 05/09/2015 00:45, Trey Jones a écrit :

I've written up my analysis of the ElasticSearch language detection plugin that Erik recently enabled:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Ev...

The short version is that it really likes Romanian (and Italian, and has a bit of a thing for French), and precision on English is great, but recall is poor (probably because of all the typos and other crap that go to enwiki that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language detectors and (b) the effect of a good language detector on zero results rate (i.e., simulate sending queries to the right place and see how much of a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

Wikimedia-search mailing listWikimedia-search@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

3362

Age (days ago)

3366

Last active (days ago)

discovery@lists.wikimedia.org

7 comments

4 participants

tags (0)

participants (4)

David Causse
Federico Leva (Nemo)
Oliver Keyes
Trey Jones