So, the technical term (in English) for these filler words is "stop words",[1] and stripping them is common practice (esp. back in the bad old days when we had less powerful computers—though it made searching for "to be or not to be" really really hard). Stripping them when a query fails is a reasonable fallback plan, as Kevin suggests. (And "between" is usually on the list, too, so searching /connection power laws zipf distribution/ gives fine results, and I'd certainly include "what's" and other contractions on the list.)

The wiki link at [1] has links to several lists, including one with 29 languages [2]—though the link there is broken (but I found it on archive.org.[3] The Spanish and French, at least, are a little light (part of the problem is all the forms of a given verb—which they don't seem to include, just the most common ones). (And I'd suggest adding variants without diacritics.)

Alternatively, a native speaker could take frequency list of terms taken from search queries (or maybe just zero search queries) and make a custom list of stop words (which may account for question words showing up more, and other ways that queries differ from random text). It takes a couple of hours at most given the list. (I've recently done this for a personal project.)

Anyway, I don't know if doing this in English will help a whole lot for full text search. The recent analysis I did for Dan on full text zero rates indicate that enwiki is not the problem.[4] enwiki had ~14% zero results over a one-week period in both July and August. Given the level of crap we see in our searches, I can't imagine that going below 10% (for full text), which would only lower the overall rate by ~2%.

Let's ignore itwiki* for the moment; my analysis doesn't take into account the interwiki search there—are we 100% sure dashboards do? I believe it does, I just don't want it to be true. :(

It looks like we're going to have to pull down numbers for lots of individual non-English wikis—though we may get lucky of we look into individual ones and find big stupid activities (like nlwiktionary's .de domain name searches accounting for their 99% zero results rate.)

Anyway, I like stripping stop words better than relaxing AND to OR, unless there's some additional post-search ranking to sort the results into a more AND-ish order.

—Trey

[1] https://en.wikipedia.org/wiki/Stop_words
[2] https://code.google.com/p/stop-words/
[3] https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-2014-02-24.zip
[4] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Change_in_Zero_Results_Rate_by_Wiki_.28July_to_August.29


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation


On Thu, Aug 27, 2015 at 9:21 AM, David Causse <dcausse@wikimedia.org> wrote:
Le 27/08/2015 17:59, Kevin Smith a écrit :

On Thu, Aug 27, 2015 at 4:30 AM, David Causse <dcausse@wikimedia.org> wrote:
There's another feature we could work on after this one:
Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey :
Users ask questions not keywords, for example this query :
what's the connection between power laws and zipf law [1]
returns no result

but:
power laws zipf distribution [2]
returns good results


Earlier, I suggested ignoring "filler" words, but we thought elastic was already doing scoring adjustments that would have a similar effect. Apparently not, because a search for:

connection between power laws zipf distribution

brings up what look like pretty reasonable results. Throwing away "what's", "the", and "and" before running the search would help a lot (at least in this case).

Yes, the term that prevents to find the result is "what".
Elasticsearch will limit the effect of such words in the score but the default AND will force all these words to be in the document.

We have also some troubles with "what's" vs "what is"... I'll have a look.



_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search