We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share:
Here are a couple examples, feel free to change the text= around to other things.
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li... http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li... http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top.
Erik B
Thanks Erik!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
Le 26/08/2015 00:38, Erik Bernhardson a écrit :
We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share:
Here are a couple examples, feel free to change the text= around to other things.
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li... http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li... http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top.
Erik B
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Thanks David! That page is awesome for anecdotal testing.
I notice that a search for IOL (or iol) would also benefit from prioritizing an exact match. The prefix autocomplete brings up IOL near the bottom, but the suggester misses it entirely. Same for XP, CPU, etc.
Kevin Smith Agile Coach, Wikimedia Foundation
On Wed, Aug 26, 2015 at 6:02 AM, David Causse dcausse@wikimedia.org wrote:
Thanks Erik!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
Le 26/08/2015 00:38, Erik Bernhardson a écrit :
We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share:
Here are a couple examples, feel free to change the text= around to other things.
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top.
Erik B
Wikimedia-search mailing listWikimedia-search@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Thanks Kevin!
yes you're right it's a missing feature, prefix search run a specific query to find the exact match. I'll try to add it.
Le 26/08/2015 19:48, Kevin Smith a écrit :
Thanks David! That page is awesome for anecdotal testing.
I notice that a search for IOL (or iol) would also benefit from prioritizing an exact match. The prefix autocomplete brings up IOL near the bottom, but the suggester misses it entirely. Same for XP, CPU, etc.
Kevin Smith Agile Coach, Wikimedia Foundation / /
On Wed, Aug 26, 2015 at 6:02 AM, David Causse <dcausse@wikimedia.org mailto:dcausse@wikimedia.org> wrote:
Thanks Erik! I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html Le 26/08/2015 00:38, Erik Bernhardson a écrit :
We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share: Here are a couple examples, feel free to change the text= around to other things. http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&limit=100&text=white%20house http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&limit=100&text=hotel%20cal http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&limit=100&text=wikim Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top. Erik B _______________________________________________ Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org <mailto:Wikimedia-search@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
_______________________________________________ Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org <mailto:Wikimedia-search@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Hi!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
This is very cool! From my very short testing, seems that it works pretty nicely.
I ran some zero result rate tests against this API today, it is a huge reduction in the zero result rate over the existing prefix search. from 32% to 19% (on a 1% sample of prefix searches for an entire day)
On Wed, Aug 26, 2015 at 12:34 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
This is very cool! From my very short testing, seems that it works pretty nicely.
-- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
And that's in line with the previous experiment. If you have a 32% zero results rate, reducing it by 38% (32% * (1-.38)) gives 19.84%. So, allow a little rounding error in the "32", "38" and "19", and this is right on the money.
—Trey P.S.: 2 + 2 = 5, for very large values of 2.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 26, 2015 at 3:58 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
I ran some zero result rate tests against this API today, it is a huge reduction in the zero result rate over the existing prefix search. from 32% to 19% (on a 1% sample of prefix searches for an entire day)
On Wed, Aug 26, 2015 at 12:34 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
This is very cool! From my very short testing, seems that it works pretty nicely.
-- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
So I'm hearing we may have a contender for 'big changes to the ZRR' then ;).
This seems to reinforce the 'big features, not small config changes' approach to the problem.
On 26 August 2015 at 19:34, Trey Jones tjones@wikimedia.org wrote:
And that's in line with the previous experiment. If you have a 32% zero results rate, reducing it by 38% (32% * (1-.38)) gives 19.84%. So, allow a little rounding error in the "32", "38" and "19", and this is right on the money.
—Trey P.S.: 2 + 2 = 5, for very large values of 2.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 26, 2015 at 3:58 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
I ran some zero result rate tests against this API today, it is a huge reduction in the zero result rate over the existing prefix search. from 32% to 19% (on a 1% sample of prefix searches for an entire day)
On Wed, Aug 26, 2015 at 12:34 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
This is very cool! From my very short testing, seems that it works pretty nicely.
-- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Yes you're right, reading and re-reading cirrus config file I can't find anything that could bring more results by just tweaking some config values :(
Next step is to use interwiki searches to run queries written in another language which is also a "big feature".
There's another feature we could work on after this one: Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey : Users ask questions not keywords, for example this query : what's the connection between power laws and zipf law [1] returns no result
but: power laws zipf distribution [2] returns good results
I think a first naive approach would be to review this default AND and try something like: if there is more than X words allow Y% to match.
[1] https://en.wikipedia.org/w/index.php?title=Special:Search&search=what%27... [2] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=defa...
Le 27/08/2015 04:39, Oliver Keyes a écrit :
So I'm hearing we may have a contender for 'big changes to the ZRR' then ;).
This seems to reinforce the 'big features, not small config changes' approach to the problem.
On 26 August 2015 at 19:34, Trey Jones tjones@wikimedia.org wrote:
And that's in line with the previous experiment. If you have a 32% zero results rate, reducing it by 38% (32% * (1-.38)) gives 19.84%. So, allow a little rounding error in the "32", "38" and "19", and this is right on the money.
—Trey P.S.: 2 + 2 = 5, for very large values of 2.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 26, 2015 at 3:58 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
I ran some zero result rate tests against this API today, it is a huge reduction in the zero result rate over the existing prefix search. from 32% to 19% (on a 1% sample of prefix searches for an entire day)
On Wed, Aug 26, 2015 at 12:34 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
This is very cool! From my very short testing, seems that it works pretty nicely.
-- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
That sounds like a great set of ideas. Are we capturing these in phabricator tickets?
(Another approach with 'and' statements would be something like:
If it has a question mark: consider ANDs to be strings rather than operators else; Use AND as an operator. If that produces zero results: round-trip with AND as a string.
)
On 27 August 2015 at 07:30, David Causse dcausse@wikimedia.org wrote:
Yes you're right, reading and re-reading cirrus config file I can't find anything that could bring more results by just tweaking some config values :(
Next step is to use interwiki searches to run queries written in another language which is also a "big feature".
There's another feature we could work on after this one: Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey : Users ask questions not keywords, for example this query : what's the connection between power laws and zipf law [1] returns no result
but: power laws zipf distribution [2] returns good results
I think a first naive approach would be to review this default AND and try something like: if there is more than X words allow Y% to match.
[1] https://en.wikipedia.org/w/index.php?title=Special:Search&search=what%27... [2] https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=defa...
Le 27/08/2015 04:39, Oliver Keyes a écrit :
So I'm hearing we may have a contender for 'big changes to the ZRR' then ;).
This seems to reinforce the 'big features, not small config changes' approach to the problem.
On 26 August 2015 at 19:34, Trey Jones tjones@wikimedia.org wrote:
And that's in line with the previous experiment. If you have a 32% zero results rate, reducing it by 38% (32% * (1-.38)) gives 19.84%. So, allow a little rounding error in the "32", "38" and "19", and this is right on the money.
—Trey P.S.: 2 + 2 = 5, for very large values of 2.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 26, 2015 at 3:58 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
I ran some zero result rate tests against this API today, it is a huge reduction in the zero result rate over the existing prefix search. from 32% to 19% (on a 1% sample of prefix searches for an entire day)
On Wed, Aug 26, 2015 at 12:34 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I uploaded a small HTML page to compare both approaches: http://cirrus-browser-bot.wmflabs.org/suggest.html
This is very cool! From my very short testing, seems that it works pretty nicely.
-- Stas Malyshev smalyshev@wikimedia.org
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
On Thu, Aug 27, 2015 at 4:30 AM, David Causse dcausse@wikimedia.org wrote:
There's another feature we could work on after this one: Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey : Users ask questions not keywords, for example this query : what's the connection between power laws and zipf law [1] returns no result
but: power laws zipf distribution [2] returns good results
Earlier, I suggested ignoring "filler" words, but we thought elastic was already doing scoring adjustments that would have a similar effect. Apparently not, because a search for:
connection between power laws zipf distribution
brings up what look like pretty reasonable results. Throwing away "what's", "the", and "and" before running the search would help a lot (at least in this case).
Kevin
Le 27/08/2015 17:59, Kevin Smith a écrit :
On Thu, Aug 27, 2015 at 4:30 AM, David Causse <dcausse@wikimedia.org mailto:dcausse@wikimedia.org> wrote:
There's another feature we could work on after this one: Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey : Users ask questions not keywords, for example this query : what's the connection between power laws and zipf law [1] returns no result but: power laws zipf distribution [2] returns good results
Earlier, I suggested ignoring "filler" words, but we thought elastic was already doing scoring adjustments that would have a similar effect. Apparently not, because a search for:
connection between power laws zipf distribution
brings up what look like pretty reasonable results. Throwing away "what's", "the", and "and" before running the search would help a lot (at least in this case).
Yes, the term that prevents to find the result is "what". Elasticsearch will limit the effect of such words in the score but the default AND will force all these words to be in the document.
We have also some troubles with "what's" vs "what is"... I'll have a look.
So, the technical term (in English) for these filler words is "stop words",[1] and stripping them is common practice (esp. back in the bad old days when we had less powerful computers—though it made searching for "to be or not to be" really really hard). Stripping them when a query fails is a reasonable fallback plan, as Kevin suggests. (And "between" is usually on the list, too, so searching /connection power laws zipf distribution/ gives fine results, and I'd certainly include "what's" and other contractions on the list.)
The wiki link at [1] has links to several lists, including one with 29 languages [2]—though the link there is broken (but I found it on archive.org.[3] The Spanish and French, at least, are a little light (part of the problem is all the forms of a given verb—which they don't seem to include, just the most common ones). (And I'd suggest adding variants without diacritics.)
Alternatively, a native speaker could take frequency list of terms taken from search queries (or maybe just zero search queries) and make a custom list of stop words (which may account for question words showing up more, and other ways that queries differ from random text). It takes a couple of hours at most given the list. (I've recently done this for a personal project.)
Anyway, I don't know if doing this in English will help a whole lot for full text search. The recent analysis I did for Dan on full text zero rates indicate that enwiki is not the problem.[4] enwiki had ~14% zero results over a one-week period in both July and August. Given the level of crap we see in our searches, I can't imagine that going below 10% (for full text), which would only lower the overall rate by ~2%.
Let's ignore itwiki* for the moment; my analysis doesn't take into account the interwiki search there—are we 100% sure dashboards do? I believe it does, I just don't want it to be true. :(
It looks like we're going to have to pull down numbers for lots of individual non-English wikis—though we may get lucky of we look into individual ones and find big stupid activities (like nlwiktionary's .de domain name searches accounting for their 99% zero results rate.)
Anyway, I like stripping stop words better than relaxing AND to OR, unless there's some additional post-search ranking to sort the results into a more AND-ish order.
—Trey
[1] https://en.wikipedia.org/wiki/Stop_words [2] https://code.google.com/p/stop-words/ [3] https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-2... [4] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Result...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Thu, Aug 27, 2015 at 9:21 AM, David Causse dcausse@wikimedia.org wrote:
Le 27/08/2015 17:59, Kevin Smith a écrit :
On Thu, Aug 27, 2015 at 4:30 AM, David Causse dcausse@wikimedia.org wrote:
There's another feature we could work on after this one: Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey : Users ask questions not keywords, for example this query : what's the connection between power laws and zipf law [1] returns no result
but: power laws zipf distribution [2] returns good results
Earlier, I suggested ignoring "filler" words, but we thought elastic was already doing scoring adjustments that would have a similar effect. Apparently not, because a search for:
connection between power laws zipf distribution
brings up what look like pretty reasonable results. Throwing away "what's", "the", and "and" before running the search would help a lot (at least in this case).
Yes, the term that prevents to find the result is "what". Elasticsearch will limit the effect of such words in the score but the default AND will force all these words to be in the document.
We have also some troubles with "what's" vs "what is"... I'll have a look.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Can I say how hilarious it is that we're discussing stop words in the context of Zipf Distributions? ;)
On 27 August 2015 at 16:29, Trey Jones tjones@wikimedia.org wrote:
So, the technical term (in English) for these filler words is "stop words",[1] and stripping them is common practice (esp. back in the bad old days when we had less powerful computers—though it made searching for "to be or not to be" really really hard). Stripping them when a query fails is a reasonable fallback plan, as Kevin suggests. (And "between" is usually on the list, too, so searching /connection power laws zipf distribution/ gives fine results, and I'd certainly include "what's" and other contractions on the list.)
The wiki link at [1] has links to several lists, including one with 29 languages [2]—though the link there is broken (but I found it on archive.org.[3] The Spanish and French, at least, are a little light (part of the problem is all the forms of a given verb—which they don't seem to include, just the most common ones). (And I'd suggest adding variants without diacritics.)
Alternatively, a native speaker could take frequency list of terms taken from search queries (or maybe just zero search queries) and make a custom list of stop words (which may account for question words showing up more, and other ways that queries differ from random text). It takes a couple of hours at most given the list. (I've recently done this for a personal project.)
Anyway, I don't know if doing this in English will help a whole lot for full text search. The recent analysis I did for Dan on full text zero rates indicate that enwiki is not the problem.[4] enwiki had ~14% zero results over a one-week period in both July and August. Given the level of crap we see in our searches, I can't imagine that going below 10% (for full text), which would only lower the overall rate by ~2%.
Let's ignore itwiki* for the moment; my analysis doesn't take into account the interwiki search there—are we 100% sure dashboards do? I believe it does, I just don't want it to be true. :(
It looks like we're going to have to pull down numbers for lots of individual non-English wikis—though we may get lucky of we look into individual ones and find big stupid activities (like nlwiktionary's .de domain name searches accounting for their 99% zero results rate.)
Anyway, I like stripping stop words better than relaxing AND to OR, unless there's some additional post-search ranking to sort the results into a more AND-ish order.
—Trey
[1] https://en.wikipedia.org/wiki/Stop_words [2] https://code.google.com/p/stop-words/ [3] https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-2... [4] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Result...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Thu, Aug 27, 2015 at 9:21 AM, David Causse dcausse@wikimedia.org wrote:
Le 27/08/2015 17:59, Kevin Smith a écrit :
On Thu, Aug 27, 2015 at 4:30 AM, David Causse dcausse@wikimedia.org wrote:
There's another feature we could work on after this one: Review the default AND operator between words. This seems to be in line with Moiz's survey results and "somewhat" related to the paper reviewed by Trey : Users ask questions not keywords, for example this query : what's the connection between power laws and zipf law [1] returns no result
but: power laws zipf distribution [2] returns good results
Earlier, I suggested ignoring "filler" words, but we thought elastic was already doing scoring adjustments that would have a similar effect. Apparently not, because a search for:
connection between power laws zipf distribution
brings up what look like pretty reasonable results. Throwing away "what's", "the", and "and" before running the search would help a lot (at least in this case).
Yes, the term that prevents to find the result is "what". Elasticsearch will limit the effect of such words in the score but the default AND will force all these words to be in the document.
We have also some troubles with "what's" vs "what is"... I'll have a look.
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Le 27/08/2015 22:29, Trey Jones a écrit :
Anyway, I like stripping stop words better than relaxing AND to OR, unless there's some additional post-search ranking to sort the results into a more AND-ish order.
I think my previous mail was misleading, I don't want to replace AND by OR. I mean when the query contains a lot of words (questions) the default AND is not appropriate because a single missing stopword could hide a good result. We could use the minimum_should_match attribute which allows to force a minimal number term to match (e.g. 90% of the query terms should match).
There's also another interesting query which will do the "stopwords stripping" automagically, it's the common term query [1]. In few words this query is able to detect stopwords by analyzing word freq at query time, so the query:
What's the connection between power laws and zipf distribution will be split into 2 clauses : - connection power laws zipf distribution - what's the between and
And we can control the boolean operator of these clauses independently, e.g. OR for high freq words and AND for low freq words. Or even more complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of them are required.
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-co... [2] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mi...
Yeah, it looks like Common Terms is a low-effort, high-value way of dealing with this issue. Of course ES is going to have some clever way of dealing with stop words.
Here's a more readable blog post about Common Terms: https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-...
And, for reference, ES has stop word lists for >30 languages: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-sto...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Aug 28, 2015 at 1:34 AM, David Causse dcausse@wikimedia.org wrote:
Le 27/08/2015 22:29, Trey Jones a écrit :
Anyway, I like stripping stop words better than relaxing AND to OR, unless there's some additional post-search ranking to sort the results into a more AND-ish order.
I think my previous mail was misleading, I don't want to replace AND by OR. I mean when the query contains a lot of words (questions) the default AND is not appropriate because a single missing stopword could hide a good result. We could use the minimum_should_match attribute which allows to force a minimal number term to match (e.g. 90% of the query terms should match).
There's also another interesting query which will do the "stopwords stripping" automagically, it's the common term query [1]. In few words this query is able to detect stopwords by analyzing word freq at query time, so the query:
What's the connection between power laws and zipf distribution will be split into 2 clauses :
- connection power laws zipf distribution
- what's the between and
And we can control the boolean operator of these clauses independently, e.g. OR for high freq words and AND for low freq words. Or even more complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of them are required.
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-co... [2] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mi...
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Our initial tests of this suggestions API are incredibly promising at reducing the zero results rate: https://phabricator.wikimedia.org/T109729
More rigorous testing must be done before we can consider replacing prefixsearch with the suggestion API.
Thanks!
Dan
On 25 August 2015 at 15:38, Erik Bernhardson ebernhardson@wikimedia.org wrote:
We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share:
Here are a couple examples, feel free to change the text= around to other things.
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top.
Erik B
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
We had to shuffle a few things around in labs, the demo is now running at:
http://suggesty.wmflabs.org/suggest.html
On Thu, Aug 27, 2015 at 11:00 AM, Dan Garry dgarry@wikimedia.org wrote:
Our initial tests of this suggestions API are incredibly promising at reducing the zero results rate: https://phabricator.wikimedia.org/T109729
More rigorous testing must be done before we can consider replacing prefixsearch with the suggestion API.
Thanks!
Dan
On 25 August 2015 at 15:38, Erik Bernhardson ebernhardson@wikimedia.org wrote:
We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share:
Here are a couple examples, feel free to change the text= around to other things.
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&li...
Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top.
Erik B
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
wikimedia-search@lists.wikimedia.org