Yeah, it looks like Common Terms is a low-effort, high-value way of dealing with this issue. Of course ES is going to have some clever way of dealing with stop words.
Here's a more readable blog post about Common Terms: https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-...
And, for reference, ES has stop word lists for >30 languages: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-sto...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Aug 28, 2015 at 1:34 AM, David Causse dcausse@wikimedia.org wrote:
Le 27/08/2015 22:29, Trey Jones a écrit :
Anyway, I like stripping stop words better than relaxing AND to OR, unless there's some additional post-search ranking to sort the results into a more AND-ish order.
I think my previous mail was misleading, I don't want to replace AND by OR. I mean when the query contains a lot of words (questions) the default AND is not appropriate because a single missing stopword could hide a good result. We could use the minimum_should_match attribute which allows to force a minimal number term to match (e.g. 90% of the query terms should match).
There's also another interesting query which will do the "stopwords stripping" automagically, it's the common term query [1]. In few words this query is able to detect stopwords by analyzing word freq at query time, so the query:
What's the connection between power laws and zipf distribution will be split into 2 clauses :
- connection power laws zipf distribution
- what's the between and
And we can control the boolean operator of these clauses independently, e.g. OR for high freq words and AND for low freq words. Or even more complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of them are required.
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-co... [2] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mi...
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search