It seems we will have a number of different options to try, I wonder if its
better to have independent rules or tie them all together into a more
generic rule.
For example:
Underscore stripping
Converting + into space (or just urldecoding)
Quote stripping (the bad `quot` ones, but also things that are
legitimately quoted but the quoted query has no results)
Timestamp stripping?
A highly generic rule that would probably get more (but worse) results:
Either remove or convert into a space everything thats not alphadecimal
Maybe even join the words with 'OR' instead of 'AND' if there are
enough
tokens
If we go the route of attempting to rewrite the query into something more
plausible, is that something we would be building into elasticsearch, or
cirrussearch? I could come up for plausible reasons for it being on either
side but am leaning towards some sort of custom suggester implementation
that does our own thing (although that may be due to a lack of knowing the
internal api limitations there).
On Tue, Jul 28, 2015 at 3:24 PM, Chad Horohoe <chorohoe(a)wikimedia.org>
wrote:
On Tue, Jul 28, 2015 at 12:57 PM, Trey Jones
<tjones(a)wikimedia.org> wrote:
The boolean AND queries are largely in enwiki
(17607: ~3.5% overall,
~7.9% in enwiki), and they are a mixed bag, but many (626) appear with
quot, and most (16657) are of the form
"article_title_with_underscore" AND "article title without
underscores"
where the first half is repeated over and over and the second half is
something linked to in the first article. Find the source and add to the
automata list.
We can probably do better on the underscores thing. Nik
even said as much back in November[0].
-Chad
[0]
https://phabricator.wikimedia.org/T64059
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search