Hmm. I did a quick test on searching for some DOIs, and in fact Lagotto's
syntax works fine. But most articles in the world are not in fact
referenced in Wikipedia. I searched for DOI and found an example
DOI: 10.1016/j.fgb.2007.07.013. All of these searches give the same 2
results:
10.1016/j.fgb.2007.07.013
"10.1016/j.fgb.2007.07.013"
"10.1016/j.fgb.2007.07.013" OR "
http://www.sciencedirect.com/science/article/pii/S1087184507001259"
insource:10.1016/j.fgb.2007.07.013 gives a third result, but it's actually
not relevant (it's a partial match on "10.1016/j.fgb"). So maybe Nemo
should withdraw the suggestion to Lagotto entirely?
What we have here may actually just be 50,000 searches (per hour) for
things that do not exist in Wikipedia, and zero results is the correct
answer.
It sounds more and more like "zero results queries from known automata" is
a good category for the dashboard.
By the way, while I like machine learning as much as the next math nerd,
that's not the only relevant approach. I found these guys by hand very
quickly, and we can definitely get low-hanging fruit like this manually.
(The quot quries are another example.) I also think some minimal analysis
by an expert system could identify other instances of clear categories of
non-failure zero-results (like prefix searches; the series ant ... antm ...
antma ... antman is clearly going somewhere, even though antma has no
results.)
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Tue, Jul 28, 2015 at 11:10 AM, David Causse <dcausse(a)wikimedia.org>
wrote:
Le 28/07/2015 16:32, Trey Jones a écrit :
Nemo recommended insource: to Lagotto because it
would actually work and
do what they want, but didn't consider the computational cost on our end.
However, if we only allow 20 at a time, they would probably monopolize it
entirely. In my sample we got about 50,000 of these queries in about an
hour.
David/Chad, can you look at Nemo's issue and comment there on what's
plausible and what's not?
https://github.com/lagotto/lagotto/issues/405
I added a comment there.
Also, is this the kind of use case that we want
to support? I'm not
suggesting that it isn't, I really don't know. But they aren't looking for
information, they are looking for something akin to impact factor on
reputable parts of the web. If that's not something we want to support, how
do we let them know? If that doesn't help—e.g., because it's some other
installation using their tool that's generating all the queries—do we block
it?
I don't know what to do with this, they use our search engine as a
workaround because I guess they don't want to deal with too much data and
it's pretty convenient to send a query on a system that do not blacklist
anyone. I they were using google they would have been able to run something
like 1 query per minute.
We should block/limit a source if :
- It hurts the system and make the search experience bad for others
- It pollutes our stats in a way that it's impossible for us to learn
anything from search logs
When we'll start to do some statistical machine learning this is something
that we will have to address.
Concerning the costly operators, if other tools/sources start to use them
in a way that affect the system performance I'm afraid we will have to make
these expert features protected by some permissions granted by wiki admins.
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search