One issue I’ve had in the back of my mind but haven’t really made explicit
is the question of exactly how to deal with these oddball queries.
There’s query normalization—converting + and _ to spaces, converting curly
quotes to straight quotes, and the like—which should do at least whatever
normalization the indexer does. (Is there documentation on what
normalization the indexer does do?)
Then there are more destructive/transformative techniques, like stripping
quot and timestamps, that should perhaps only be used for suggestions
(which can be rolled over into re-queries when the original gives zero
results).
URL decoding is sort of in between the two, to me. It’s more transformative
than straightening quotes, but probably preserves the original intent of
the searcher.
And then there are the last ditch attempts to get something vaguely
relevant, like converting all non-alphanumerics to spaces—which is a very
good last ditch effort, but should only be used if the original and maybe
other backoffs fail.
In theory I also like David’s idea of refactoring things so multiple query
expansion profiles are possible. Of course multiple searches are expensive.
But even if we can’t run too many queries per search, the ability to run
different expanded queries for different classes of queries is cool. (e.g.,
one word query => aggressive expansion; 50 word query => minimal expansion;
500 word query (see below!) => no expansion.)
In addition to identifying patterns and categories of searches that we can
treat differently for analytics, it would make sense to do the same for
actual queries. In our quest for non-zero results we shouldn’t favor recall
over precision so much that we lose relevance.
One category I found was even crazier than I had anticipated was length. I
don’t have the full details at hand at the moment, but out of 500K
zero-result queries, there were thousands that were more than 100
characters long, and many that were over a thousand characters long. The
longest were over 5000 characters. We should have a heuristic to not do any
query expansion for queries longer than x characters, or z tokens or
something. Doing OR expansion on hundreds of words—they often look like
excerpts from books or articles—is a waste of our computational resources.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Wed, Jul 29, 2015 at 6:45 AM, David Causse <dcausse(a)wikimedia.org> wrote:
Le 29/07/2015 00:32, Erik Bernhardson a écrit :
It seems we will have a number of different
options to try, I wonder if
its better to have independent rules or tie them all together into a more
generic rule.
For example:
Underscore stripping
Converting + into space (or just urldecoding)
_ and + are already handled by the lucene analysis chain. If the query
"article_title" don't match "article title" won't match also
:
- third_term[0]
- third+term[1]
Do you have an example where query_with_underscore returned no result and
query with underscore returned a result?
Quote stripping (the bad `quot` ones, but also things that are
legitimately quoted but the quoted query has no
results)
Timestamp stripping?
A highly generic rule that would probably get more (but worse) results:
Either remove or convert into a space everything thats not alphadecimal
Maybe even join the words with 'OR' instead of 'AND' if there are
enough tokens
Re-formating the query at character level can be quite dangerous because
it can conflict with the analysis chain.
Concerning OR and AND I agree, but we have to make sure it won't hurt the
scoring. This is the purpose of query expansion[2]
Today we have only one query expansion profile which permits to use the
full syntax offered by cirrus. IMHO the current profile is optimized for
precision.
But we could implement different profiles. To illustrate this idea look at
the query word1 word2[3], today the expansion is an AND query over the
all.plain with boost 1 and all with boost 0.5.
- all.plain contains exact words
- all contains exact words + stems
Another expansion profile could be :
- AND over all.plain boost 1
- AND over all boost 0.5
- OR over all.plain with boost 0.2
- OR over all with boost 0.1
This is over simplified but if we could refactor cirrus in a way that is
easy to implement different query expansion profiles it would be great. We
could get rid of query_string for some profiles and use more advanced DSL
query clauses (dismax, boosting query, common term query...).
[0]
https://en.wikipedia.org/w/api.php?action=query&format=json&srsearc…
[1]
https://en.wikipedia.org/w/api.php?action=query&format=json&srsearc…
[2]
https://en.wikipedia.org/wiki/Query_expansion
[3]
https://en.wikipedia.org/w/index.php?search=word1+word2&title=Special%3…
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search