Re: [Wikimedia-search] Initial data review of 100K full_text zero-results queries to enwiki

28 Jul 2015

Hmm. I did a quick test on searching for some DOIs, and in fact Lagotto's
syntax works fine. But most articles in the world are not in fact
referenced in Wikipedia. I searched for DOI and found an example
DOI: 10.1016/j.fgb.2007.07.013. All of these searches give the same 2
results:

10.1016/j.fgb.2007.07.013
"10.1016/j.fgb.2007.07.013"
"10.1016/j.fgb.2007.07.013" OR "
http://www.sciencedirect.com/science/article/pii/S1087184507001259"

insource:10.1016/j.fgb.2007.07.013 gives a third result, but it's actually
not relevant (it's a partial match on "10.1016/j.fgb"). So maybe Nemo
should withdraw the suggestion to Lagotto entirely?

What we have here may actually just be 50,000 searches (per hour) for
things that do not exist in Wikipedia, and zero results is the correct
answer.

It sounds more and more like "zero results queries from known automata" is
a good category for the dashboard.

By the way, while I like machine learning as much as the next math nerd,
that's not the only relevant approach. I found these guys by hand very
quickly, and we can definitely get low-hanging fruit like this manually.
(The quot quries are another example.) I also think some minimal analysis
by an expert system could identify other instances of clear categories of
non-failure zero-results (like prefix searches; the series ant ... antm ...
antma ... antman is clearly going somewhere, even though antma has no
results.)

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Tue, Jul 28, 2015 at 11:10 AM, David Causse &lt;dcausse(a)wikimedia.org&gt;
wrote:

...
  Le 28/07/2015 16:32, Trey Jones a écrit :

  Nemo recommended insource: to Lagotto because it
would actually work and
 do what they want, but didn't consider the computational cost on our end.
 However, if we only allow 20 at a time, they would probably monopolize it
 entirely. In my sample we got about 50,000 of these queries in about an
 hour.

 David/Chad, can you look at Nemo's issue and comment there on what's
 plausible and what's not?
 https://github.com/lagotto/lagotto/issues/405

 I added a comment there.

  Also, is this the kind of use case that we want
to support? I'm not
 suggesting that it isn't, I really don't know. But they aren't looking for
 information, they are looking for something akin to impact factor on
 reputable parts of the web. If that's not something we want to support, how
 do we let them know? If that doesn't help—e.g., because it's some other
 installation using their tool that's generating all the queries—do we block
 it?

 I don't know what to do with this, they use our search engine as a
 workaround because I guess they don't want to deal with too much data and
 it's pretty convenient to send a query on a system that do not blacklist
 anyone. I they were using google they would have been able to run something
 like 1 query per minute.

 We should block/limit a source if :
 - It hurts the system and make the search experience bad for others
 - It pollutes our stats in a way that it's impossible for us to learn
 anything from search logs

 When we'll start to do some statistical machine learning this is something
 that we will have to address.

 Concerning the costly operators, if other tools/sources start to use them
 in a way that affect the system performance I'm afraid we will have to make
 these expert features protected by some permissions granted by wiki admins.

 _______________________________________________
 Wikimedia-search mailing list
 Wikimedia-search(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] Initial data review of 100K full_text zero-results queries to enwiki