Re: [Wikimedia-search] 500K multilingual wikipedia zero-results queries

29 Jul 2015

Le 29/07/2015 16:53, Trey Jones a écrit :
...

 One issue I’ve had in the back of my mind but haven’t really made 
 explicit is the question of exactly how to deal with these oddball 
 queries.

 There’s query normalization—converting + and _ to spaces, converting 
 curly quotes to straight quotes, and the like—which should do at least 
 whatever normalization the indexer does. (Is there documentation on 
 what normalization the indexer does do?)

Unfortunately no and I'm afraid that the analysis chain is too complex 
to write an exhaustive description.
The easiest way to check this is with vagrant and run these 
elasticsearch requests :

A simple fulltext query will target the title and redirects data with 
the all_near_match field :

curl -XGET 
'localhost:9200/wiki_content/_analyze?field=all_near_match&pretty' -d 
'article_title article+title'; echo
{
   "tokens" : [ {
     "token" : "article title article+title",
     "start_offset" : 0,
     "end_offset" : 27,
     "type" : "word",
     "position" : 1
   } ]
}

A fulltext search will also query the all.plain field (all fields with a 
standard analyzer) :
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all.pain&pretty' 
-d 'article_title article+title'; echo

{
   "tokens" : [ {
     "token" : "article_title",
     "start_offset" : 0,
     "end_offset" : 13,
     "type" : "<ALPHANUM>",
     "position" : 1
   }, {
     "token" : "article",
     "start_offset" : 14,
     "end_offset" : 21,
     "type" : "<ALPHANUM>",
     "position" : 2
   }, {
     "token" : "title",
     "start_offset" : 22,
     "end_offset" : 27,
     "type" : "<ALPHANUM>",
     "position" : 3
   } ]
}

And finally the all field which depends on the language (stems, stopwords):
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all&pretty' -d 
'article_title article+title'; echo
{
   "tokens" : [ {
     "token" : "article",
     "start_offset" : 0,
     "end_offset" : 7,
     "type" : "<ALPHANUM>",
     "position" : 1
   }, {
     "token" : "title",
     "start_offset" : 8,
     "end_offset" : 13,
     "type" : "<ALPHANUM>",
     "position" : 2
   }, {
     "token" : "article",
     "start_offset" : 14,
     "end_offset" : 21,
     "type" : "<ALPHANUM>",
     "position" : 3
   }, {
     "token" : "title",
     "start_offset" : 22,
     "end_offset" : 27,
     "type" : "<ALPHANUM>",
     "position" : 4
   } ]
}

So in this case + and _ should not prevent the query from matching the 
doc. It would be even worse to remove them because it would be 
impossible to find words that have been indexed with an '_'.
IMHO we should let lucene do its job because we would run into many 
subtle bugs if we try to normalize something before.

But the query "article_title" (with quotes) will target only the 
all.plain and underscore are kept.
I think the proper fallback method here is to drop the quotes when 
there's no match with quotes[0]

[0] https://www.google.fr/search?q="google+you+don%27t+have+this+page"

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] 500K multilingual wikipedia zero-results queries