Nemo recommended insource: to Lagotto because it would actually work and do
what they want, but didn't consider the computational cost on our end.
However, if we only allow 20 at a time, they would probably monopolize it
entirely. In my sample we got about 50,000 of these queries in about an
hour.
David/Chad, can you look at Nemo's issue and comment there on what's
plausible and what's not?
https://github.com/lagotto/lagotto/issues/405
Also, is this the kind of use case that we want to support? I'm not
suggesting that it isn't, I really don't know. But they aren't looking for
information, they are looking for something akin to impact factor on
reputable parts of the web. If that's not something we want to support, how
do we let them know? If that doesn't help—e.g., because it's some other
installation using their tool that's generating all the queries—do we block
it?
At the very least, we should ignore these malformed queries in our own
metrics.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Tue, Jul 28, 2015 at 10:18 AM, David Causse <dcausse(a)wikimedia.org>
wrote:
Le 28/07/2015 15:09, Trey Jones a écrit :
My first pass at data gathering at the end of the
day yesterday was
slightly skewed, but the general trend still hold... getting Nemo's
suggested "insource:" queries into Lagotto will definitely cut down on the
number of zero-results searches we get (and actually do what they intend),
if the update gets pushed to their heavy users.
Beware that insource is the most expensive query and we allow only 20
insource queries to run concurrently (for all wikipedia sites). I'm not
sure it's a good idea to expose this tool too widely.
There is several features like that, (I mean syntax that's not available
in any other search engine with a large audience) :
- wildcard queries (*)
- insource
- fuzzy searches
While these features are very useful to "expert users" I think we should
not rely on such syntax to decrease the zero result rate because it won't
scale.
Another solution for this specific use case is to build a custom analyzer
that will extract this information from the content and expose a scalable
search field.
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search