Hmm. I did a quick test on searching for some DOIs, and in fact Lagotto's syntax works fine. But most articles in the world are not in fact referenced in Wikipedia. I searched for DOI and found an example DOI: 10.1016/j.fgb.2007.07.013. All of these searches give the same 2 results:

10.1016/j.fgb.2007.07.013
"10.1016/j.fgb.2007.07.013"
"10.1016/j.fgb.2007.07.013" OR "http://www.sciencedirect.com/science/article/pii/S1087184507001259"

insource:10.1016/j.fgb.2007.07.013 gives a third result, but it's actually not relevant (it's a partial match on "10.1016/j.fgb"). So maybe Nemo should withdraw the suggestion to Lagotto entirely?

What we have here may actually just be 50,000 searches (per hour) for things that do not exist in Wikipedia, and zero results is the correct answer.

It sounds more and more like "zero results queries from known automata" is a good category for the dashboard.

By the way, while I like machine learning as much as the next math nerd, that's not the only relevant approach. I found these guys by hand very quickly, and we can definitely get low-hanging fruit like this manually. (The quot quries are another example.) I also think some minimal analysis by an expert system could identify other instances of clear categories of non-failure zero-results (like prefix searches; the series ant ... antm ... antma ... antman is clearly going somewhere, even though antma has no results.)

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation


On Tue, Jul 28, 2015 at 11:10 AM, David Causse <dcausse@wikimedia.org> wrote:
Le 28/07/2015 16:32, Trey Jones a écrit :
Nemo recommended insource: to Lagotto because it would actually work and do what they want, but didn't consider the computational cost on our end. However, if we only allow 20 at a time, they would probably monopolize it entirely. In my sample we got about 50,000 of these queries in about an hour.

David/Chad, can you look at Nemo's issue and comment there on what's plausible and what's not?
https://github.com/lagotto/lagotto/issues/405

I added a comment there.


Also, is this the kind of use case that we want to support? I'm not suggesting that it isn't, I really don't know. But they aren't looking for information, they are looking for something akin to impact factor on reputable parts of the web. If that's not something we want to support, how do we let them know? If that doesn't help—e.g., because it's some other installation using their tool that's generating all the queries—do we block it?

I don't know what to do with this, they use our search engine as a workaround because I guess they don't want to deal with too much data and it's pretty convenient to send a query on a system that do not blacklist anyone. I they were using google they would have been able to run something like 1 query per minute.

We should block/limit a source if :
- It hurts the system and make the search experience bad for others
- It pollutes our stats in a way that it's impossible for us to learn anything from search logs

When we'll start to do some statistical machine learning this is something that we will have to address.

Concerning the costly operators, if other tools/sources start to use them in a way that affect the system performance I'm afraid we will have to make these expert features protected by some permissions granted by wiki admins.



_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search