So, the technical term (in English) for these filler words is "stop
words",[1] and stripping them is common practice (esp. back in the bad old
days when we had less powerful computers—though it made searching for "to
be or not to be" really really hard). Stripping them when a query fails is
a reasonable fallback plan, as Kevin suggests. (And "between" is usually on
the list, too, so searching /connection power laws zipf distribution/ gives
fine results, and I'd certainly include "what's" and other contractions
on
the list.)
The wiki link at [1] has links to several lists, including one with 29
languages [2]—though the link there is broken (but I found it on
archive.org.[3] The Spanish and French, at least, are a little light (part
of the problem is all the forms of a given verb—which they don't seem to
include, just the most common ones). (And I'd suggest adding variants
without diacritics.)
Alternatively, a native speaker could take frequency list of terms taken
from search queries (or maybe just zero search queries) and make a custom
list of stop words (which may account for question words showing up more,
and other ways that queries differ from random text). It takes a couple of
hours at most given the list. (I've recently done this for a personal
project.)
Anyway, I don't know if doing this in English will help a whole lot for
full text search. The recent analysis I did for Dan on full text zero rates
indicate that enwiki is not the problem.[4] enwiki had ~14% zero results
over a one-week period in both July and August. Given the level of crap we
see in our searches, I can't imagine that going below 10% (for full text),
which would only lower the overall rate by ~2%.
Let's ignore itwiki* for the moment; my analysis doesn't take into account
the interwiki search there—are we 100% sure dashboards do? I believe it
does, I just don't want it to be true. :(
It looks like we're going to have to pull down numbers for lots of
individual non-English wikis—though we may get lucky of we look into
individual ones and find big stupid activities (like nlwiktionary's .de
domain name searches accounting for their 99% zero results rate.)
Anyway, I like stripping stop words better than relaxing AND to OR, unless
there's some additional post-search ranking to sort the results into a more
AND-ish order.
—Trey
[1]
https://en.wikipedia.org/wiki/Stop_words
[2]
https://code.google.com/p/stop-words/
[3]
https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-…
[4]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Thu, Aug 27, 2015 at 9:21 AM, David Causse <dcausse(a)wikimedia.org> wrote:
Le 27/08/2015 17:59, Kevin Smith a écrit :
On Thu, Aug 27, 2015 at 4:30 AM, David Causse <dcausse(a)wikimedia.org>
wrote:
There's another feature we could work on
after this one:
Review the default AND operator between words. This seems to be in line
with Moiz's survey results and "somewhat" related to the paper reviewed by
Trey :
Users ask questions not keywords, for example this query :
what's the connection between power laws and zipf law [1]
returns no result
but:
power laws zipf distribution [2]
returns good results
Earlier, I suggested ignoring "filler" words, but we thought elastic was
already doing scoring adjustments that would have a similar effect.
Apparently not, because a search for:
connection between power laws zipf distribution
brings up what look like pretty reasonable results. Throwing away
"what's", "the", and "and" before running the search
would help a lot (at
least in this case).
Yes, the term that prevents to find the result is "what".
Elasticsearch will limit the effect of such words in the score but the
default AND will force all these words to be in the document.
We have also some troubles with "what's" vs "what is"... I'll
have a look.
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search