Nemo recommended insource: to Lagotto because it would actually work and do what they want, but didn't consider the computational cost on our end. However, if we only allow 20 at a time, they would probably monopolize it entirely. In my sample we got about 50,000 of these queries in about an hour.

David/Chad, can you look at Nemo's issue and comment there on what's plausible and what's not?

https://github.com/lagotto/lagotto/issues/405

Also, is this the kind of use case that we want to support? I'm not suggesting that it isn't, I really don't know. But they aren't looking for information, they are looking for something akin to impact factor on reputable parts of the web. If that's not something we want to support, how do we let them know? If that doesn't help—e.g., because it's some other installation using their tool that's generating all the queries—do we block it?

At the very least, we should ignore these malformed queries in our own metrics.

—Trey

Trey Jones

Software Engineer, Discovery
Wikimedia Foundation

On Tue, Jul 28, 2015 at 10:18 AM, David Causse <dcausse@wikimedia.org> wrote:

Le 28/07/2015 15:09, Trey Jones a écrit :

My first pass at data gathering at the end of the day yesterday was slightly skewed, but the general trend still hold... getting Nemo's suggested "insource:" queries into Lagotto will definitely cut down on the number of zero-results searches we get (and actually do what they intend), if the update gets pushed to their heavy users.

Beware that insource is the most expensive query and we allow only 20 insource queries to run concurrently (for all wikipedia sites). I'm not sure it's a good idea to expose this tool too widely.
There is several features like that, (I mean syntax that's not available in any other search engine with a large audience) :
- wildcard queries (*)
- insource
- fuzzy searches

While these features are very useful to "expert users" I think we should not rely on such syntax to decrease the zero result rate because it won't scale.

Another solution for this specific use case is to build a custom analyzer that will extract this information from the content and expose a scalable search field.

_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search