Le 29/07/2015 16:53, Trey Jones a écrit :

One issue I’ve had in the back of my mind but haven’t really made explicit is the question of exactly how to deal with these oddball queries.

There’s query normalization—converting + and _ to spaces, converting curly quotes to straight quotes, and the like—which should do at least whatever normalization the indexer does. (Is there documentation on what normalization the indexer does do?)


Unfortunately no and I'm afraid that the analysis chain is too complex to write an exhaustive description.
The easiest way to check this is with vagrant and run these elasticsearch requests :

A simple fulltext query will target the title and redirects data with the all_near_match field :

curl -XGET 'localhost:9200/wiki_content/_analyze?field=all_near_match&pretty' -d 'article_title article+title'; echo
{
  "tokens" : [ {
    "token" : "article title article+title",
    "start_offset" : 0,
    "end_offset" : 27,
    "type" : "word",
    "position" : 1
  } ]
}

A fulltext search will also query the all.plain field (all fields with a standard analyzer) :
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all.pain&pretty' -d 'article_title article+title'; echo

{
  "tokens" : [ {
    "token" : "article_title",
    "start_offset" : 0,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "article",
    "start_offset" : 14,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "title",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

And finally the all field which depends on the language (stems, stopwords):
curl -XGET 'localhost:9200/wiki_content/_analyze?field=all&pretty' -d 'article_title article+title'; echo
{
  "tokens" : [ {
    "token" : "article",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "title",
    "start_offset" : 8,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "article",
    "start_offset" : 14,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "title",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]
}

So in this case + and _ should not prevent the query from matching the doc. It would be even worse to remove them because it would be impossible to find words that have been indexed with an '_'.
IMHO we should let lucene do its job because we would run into many subtle bugs if we try to normalize something before.

But the query "article_title" (with quotes) will target only the all.plain and underscore are kept.
I think the proper fallback method here is to drop the quotes when there's no match with quotes[0]

[0] https://www.google.fr/search?q="google+you+don%27t+have+this+page"