One issue I’ve had in the back of my mind but haven’t really made explicit is the question of exactly how to deal with these oddball queries.

There’s query normalization—converting + and _ to spaces, converting curly quotes to straight quotes, and the like—which should do at least whatever normalization the indexer does. (Is there documentation on what normalization the indexer does do?)

Then there are more destructive/transformative techniques, like stripping quot and timestamps, that should perhaps only be used for suggestions (which can be rolled over into re-queries when the original gives zero results).

URL decoding is sort of in between the two, to me. It’s more transformative than straightening quotes, but probably preserves the original intent of the searcher.

And then there are the last ditch attempts to get something vaguely relevant, like converting all non-alphanumerics to spaces—which is a very good last ditch effort, but should only be used if the original and maybe other backoffs fail.

In theory I also like David’s idea of refactoring things so multiple query expansion profiles are possible. Of course multiple searches are expensive. But even if we can’t run too many queries per search, the ability to run different expanded queries for different classes of queries is cool. (e.g., one word query => aggressive expansion; 50 word query => minimal expansion; 500 word query (see below!) => no expansion.)

In addition to identifying patterns and categories of searches that we can treat differently for analytics, it would make sense to do the same for actual queries. In our quest for non-zero results we shouldn’t favor recall over precision so much that we lose relevance.

One category I found was even crazier than I had anticipated was length. I don’t have the full details at hand at the moment, but out of 500K zero-result queries, there were thousands that were more than 100 characters long, and many that were over a thousand characters long. The longest were over 5000 characters. We should have a heuristic to not do any query expansion for queries longer than x characters, or z tokens or something. Doing OR expansion on hundreds of words—they often look like excerpts from books or articles—is a waste of our computational resources.

—Trey


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation


On Wed, Jul 29, 2015 at 6:45 AM, David Causse <dcausse@wikimedia.org> wrote:
Le 29/07/2015 00:32, Erik Bernhardson a écrit :
It seems we will have a number of different options to try, I wonder if its better to have independent rules or tie them all together into a more generic rule.

For example:
  Underscore stripping
  Converting + into space (or just urldecoding)

_ and + are already handled by the lucene analysis chain. If the query "article_title" don't match "article title" won't match also :

- third_term[0]
- third+term[1]

Do you have an example where query_with_underscore returned no result and query with underscore returned a result?

  Quote stripping (the bad `quot` ones, but also things that are legitimately quoted but the quoted query has no results)
  Timestamp stripping?

A highly generic rule that would probably get more (but worse) results:
   Either remove or convert into a space everything thats not alphadecimal
   Maybe even join the words with 'OR' instead of 'AND' if there are enough tokens

Re-formating the query at character level can be quite dangerous because it can conflict with the analysis chain.
Concerning OR and AND I agree, but we have to make sure it won't hurt the scoring. This is the purpose of query expansion[2]
Today we have only one query expansion profile which permits to use the full syntax offered by cirrus. IMHO the current profile is optimized for precision.
But we could implement different profiles. To illustrate this idea look at the query word1 word2[3], today the expansion is an AND query over the all.plain with boost 1 and all with boost 0.5.
  - all.plain contains exact words
  - all contains exact words + stems

Another expansion profile could be :
  - AND over all.plain boost 1
  - AND over all boost 0.5
  - OR over all.plain with boost 0.2
  - OR over all with boost 0.1

This is over simplified but if we could refactor cirrus in a way that is easy to implement different query expansion profiles it would be great. We could get rid of query_string for some profiles and use more advanced DSL query clauses (dismax, boosting query, common term query...).

[0] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch=third_term&namespace=0&limit=10&list=search
[1] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch=third%2Bterm&namespace=0&limit=10&list=search
[2] https://en.wikipedia.org/wiki/Query_expansion
[3] https://en.wikipedia.org/w/index.php?search=word1+word2&title=Special%3ASearch&go=Go&cirrusDumpQuery


_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search