Le 29/07/2015 16:53, Trey Jones a écrit :
One issue I’ve had in the back of my mind but haven’t really made
explicit is the question of exactly how to deal with these oddball
queries.
There’s query normalization—converting + and _ to spaces, converting
curly quotes to straight quotes, and the like—which should do at least
whatever normalization the indexer does. (Is there documentation on
what normalization the indexer does do?)
Unfortunately no and I'm afraid that the analysis chain is too complex
to write an exhaustive description.
The easiest way to check this is with vagrant and run these
elasticsearch requests :
A simple fulltext query will target the title and redirects data with
the all_near_match field :
curl -XGET
'localhost:9200/wiki_content/_analyze?field=all_near_match&pretty' -d
'article_title article+title'; echo
{
"tokens" : [ {
"token" : "article title article+title",
"start_offset" : 0,
"end_offset" : 27,
"type" : "word",
"position" : 1
} ]
}
A fulltext search will also query the all.plain field (all fields with a
standard analyzer) :
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all.pain&pretty'
-d 'article_title article+title'; echo
{
"tokens" : [ {
"token" : "article_title",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "article",
"start_offset" : 14,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "title",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 3
} ]
}
And finally the all field which depends on the language (stems, stopwords):
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all&pretty' -d
'article_title article+title'; echo
{
"tokens" : [ {
"token" : "article",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "title",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "article",
"start_offset" : 14,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "title",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
So in this case + and _ should not prevent the query from matching the
doc. It would be even worse to remove them because it would be
impossible to find words that have been indexed with an '_'.
IMHO we should let lucene do its job because we would run into many
subtle bugs if we try to normalize something before.
But the query "article_title" (with quotes) will target only the
all.plain and underscore are kept.
I think the proper fallback method here is to drop the quotes when
there's no match with quotes[0]
[0]
https://www.google.fr/search?q="google+you+don%27t+have+this+page"