There's a lot to catch up on, but some quick and easy stuff first, in
response to David's comments.
For queries that are marked as "language" (775 queries), the distribution
of token counts (word counts) up to 10 is below:
160 1 tokens
152 2 tokens
141 3 tokens
91 4 tokens
63 5 tokens
49 6 tokens
35 7 tokens
18 8 tokens
22 9 tokens
10 10 tokens
More detailed token count info, for all queries, language queries, and
non-language queries, including longer queries (max 84 tokens) see [0].
I also quickly tested David's discovery that spaces help, and the short
version is that it's worth a couple of percentage points in recall and
precision, so it's an easy win. More details at [1].
And, just for grins, I scored the current default—assume everything is
English—to see how that looks. Recall, precision, and F-Score are much
better, but it doesn't help zero results rate (or general relevancy), since
these are all queries that failed. So R&P aren't everything. Details at [2].
—Trey
[0]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_E…
[1]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_E…
[2]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_E…
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Mon, Sep 7, 2015 at 5:50 AM, David Causse <dcausse(a)wikimedia.org> wrote:
Thanks!
this is awesome.
Concerning soburdia: the typo is in the first 2 chars so our misspelling
identification will fail, searching for sucurbia properly displays
"suburbia" as a "did you mean" suggestion. This was one the
enhancement we
tried to implement but we are currently blocked by a bug in elasticsearch.
I hope it's not a common pattern because we'll add a second error with
language detection...
Is it possible to identify how many queries are 1 one/2 words/3 words?
I'm asking this question because there's another weakness in this language
detector. Characters at word boundaries seems to bear some valuable
informations concerning language features and the detector fails to make
any benefit of them if it's a one word query. Running the detector with
additional trailing spaces changed significantly the results.
For example граничащее (russian)
Detecting "граничащее" returns bg at 0.99
But detecting " граничащее " returns ru at 0.57 and bg at 0.42
But in the end I agree with your analysis in "Stupid language detection".
Mainly because the detector does not weight its results on the wiki size
(ru should be weighted higher because ruwiki is larger than bgwiki) because
it's what we are looking for. We're looking for results, we don't care too
much about the actual language of the query.
Le 05/09/2015 00:45, Trey Jones a écrit :
I've written up my analysis of the ElasticSearch language detection plugin
that Erik recently enabled:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_E…
The short version is that it really likes Romanian (and Italian, and has a
bit of a thing for French), and precision on English is great, but recall
is poor (probably because of all the typos and other crap that go to enwiki
that is still technically "English"). Chinese and Arabic are good.
I think we could do better, and we should evaluate (a) other language
detectors and (b) the effect of a good language detector on zero results
rate (i.e., simulate sending queries to the right place and see how much of
a difference it makes).
Moderately pretty pictures included.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
_______________________________________________
Wikimedia-search mailing
listWikimedia-search@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikimedia-search
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search