Re: [Wikimedia-search] Initial data review of 100K full_text zero-results queries to enwiki

28 Jul 2015

Nemo recommended insource: to Lagotto because it would actually work and do
what they want, but didn't consider the computational cost on our end.
However, if we only allow 20 at a time, they would probably monopolize it
entirely. In my sample we got about 50,000 of these queries in about an
hour.

David/Chad, can you look at Nemo's issue and comment there on what's
plausible and what's not?
    https://github.com/lagotto/lagotto/issues/405

Also, is this the kind of use case that we want to support? I'm not
suggesting that it isn't, I really don't know. But they aren't looking for
information, they are looking for something akin to impact factor on
reputable parts of the web. If that's not something we want to support, how
do we let them know? If that doesn't help—e.g., because it's some other
installation using their tool that's generating all the queries—do we block
it?

At the very least, we should ignore these malformed queries in our own
metrics.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Tue, Jul 28, 2015 at 10:18 AM, David Causse &lt;dcausse(a)wikimedia.org&gt;
wrote:

...
  Le 28/07/2015 15:09, Trey Jones a écrit :

  My first pass at data gathering at the end of the
day yesterday was
 slightly skewed, but the general trend still hold... getting Nemo's
 suggested "insource:" queries into Lagotto will definitely cut down on the
 number of zero-results searches we get (and actually do what they intend),
 if the update gets pushed to their heavy users.

 Beware that insource is the most expensive query and we allow only 20
 insource queries to run concurrently (for all wikipedia sites). I'm not
 sure it's a good idea to expose this tool too widely.
 There is several features like that, (I mean syntax that's not available
 in any other search engine with a large audience) :
 - wildcard queries (*)
 - insource
 - fuzzy searches

 While these features are very useful to "expert users" I think we should
 not rely on such syntax to decrease the zero result rate because it won't
 scale.

 Another solution for this specific use case is to build a custom analyzer
 that will extract this information from the content and expose a scalable
 search field.

 _______________________________________________
 Wikimedia-search mailing list
 Wikimedia-search(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] Initial data review of 100K full_text zero-results queries to enwiki