It seems we will have a number of different options to try, I wonder if its better to have independent rules or tie them all together into a more generic rule.

For example:

Underscore stripping

Converting + into space (or just urldecoding)

Quote stripping (the bad `quot` ones, but also things that are legitimately quoted but the quoted query has no results)

Timestamp stripping?

A highly generic rule that would probably get more (but worse) results:

Either remove or convert into a space everything thats not alphadecimal

Maybe even join the words with 'OR' instead of 'AND' if there are enough tokens

If we go the route of attempting to rewrite the query into something more plausible, is that something we would be building into elasticsearch, or cirrussearch? I could come up for plausible reasons for it being on either side but am leaning towards some sort of custom suggester implementation that does our own thing (although that may be due to a lack of knowing the internal api limitations there).

On Tue, Jul 28, 2015 at 3:24 PM, Chad Horohoe <chorohoe@wikimedia.org> wrote:

On Tue, Jul 28, 2015 at 12:57 PM, Trey Jones <tjones@wikimedia.org> wrote:
The boolean AND queries are largely in enwiki (17607: ~3.5% overall, ~7.9% in enwiki), and they are a mixed bag, but many (626) appear with quot, and most (16657) are of the form
"article_title_with_underscore" AND "article title without underscores"
where the first half is repeated over and over and the second half is something linked to in the first article. Find the source and add to the automata list.

We can probably do better on the underscores thing. Nik
even said as much back in November[0].

-Chad

[0] https://phabricator.wikimedia.org/T64059

_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search