Yeah, it looks like Common Terms is a low-effort, high-value way of dealing
with this issue. Of course ES is going to have some clever way of dealing
with stop words.
Here's a more readable blog post about Common Terms:
https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms…
And, for reference, ES has stop word lists for >30 languages:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-st…
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Fri, Aug 28, 2015 at 1:34 AM, David Causse <dcausse(a)wikimedia.org> wrote:
Le 27/08/2015 22:29, Trey Jones a écrit :
Anyway, I like stripping stop words better than
relaxing AND to OR,
unless there's some additional post-search ranking to sort the results into
a more AND-ish order.
I think my previous mail was misleading, I don't want to replace AND by
OR. I mean when the query contains a lot of words (questions) the default
AND is not appropriate because a single missing stopword could hide a good
result. We could use the minimum_should_match attribute which allows to
force a minimal number term to match (e.g. 90% of the query terms should
match).
There's also another interesting query which will do the "stopwords
stripping" automagically, it's the common term query [1].
In few words this query is able to detect stopwords by analyzing word freq
at query time, so the query:
What's the connection between power laws and zipf distribution
will be split into 2 clauses :
- connection power laws zipf distribution
- what's the between and
And we can control the boolean operator of these clauses independently,
e.g. OR for high freq words and AND for low freq words. Or even more
complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of
them are required.
[1]
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-c…
[2]
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-m…
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search