Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)

15 Jul 2016

      Forwarded at the request of Trey Jones
Hey James,
When we first started looking at zero results rate (ZRR), it was an easy
metric to calculate, and it was surprisingly high. We still look at ZRR
https://searchdata.wmflabs.org/metrics/#failure_rate because it is so
easy to measure, and anything that improves it is probably a net positive
(note the big dip when the new completion suggester was deployed!!), but we
have more complex metrics that we prefer. There's user engagement
https://searchdata.wmflabs.org/metrics/#kpi_augmented_clickthroughs/augmented
clickthroughs, which combines clicks and dwell time and other user
activity. We also use historical click data in a metric that improves when
we move clicked-on results higher in the results list, which we use with
the Relevance Forge
https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/relevanceForge
.
And I didn't mean to give the impression that *most* zero-results queries
are gibberish, though many, many are. And that was something we didn't
really know a year ago. There are also non-gibberish results that correctly
get zero results, like most DOI
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#DOI
and
many media player
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#TV_Episodes_.2F_Movies.E2.80.94.22....22_film
queries.
We also see a lot of non-notable (not-yet-notable?) public figures (local
bands, online artists, youtube musicians), and sometimes just random names.
The discussion in response to Dan's original comment in Phab mentions some
approaches to reduce the risk of automatically releasing private info, but
I still take an absolute stand against unreviewed release. If I can get a
few hundred people to click on a link like this
https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=%22James+is+a+nice+guy%22,
I can get any message I want on that list. (Curious? Did you click?) The
message could be less anonymous and much more obnoxious, obviously.
50 character limits won't stop emails and phone numbers from making the
list (which invites spam and cranks). Those can be filtered, but not
perfectly.
I've only looked at these top lists by day in the past, but on that time
scale the top results are usually under 1000 count (and that includes IP
duplicates), so the list of queries with 100 IPs might also be very small.
As I said, I'm happy to do the data slogging to try this in a better
fashion if this task is prioritized, and I'd be happy to be wrong about the
quality of the results, but I'm still not hopeful.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Fri, Jul 15, 2016 at 10:19 AM, James Heilman jmh649@gmail.com wrote:
...
The "jurrasic world" example is a good one as it was "fixed" by User:Foxj
adding a redirect
https://en.wikipedia.org/w/index.php?title=Jurrasic_world&action=history
Agree we would need to be careful. The chance of many different IPs all
searching for "DF198671E" is low but I agree not zero and we would need
to have people run the results before they are displayed.
I guess the question is how much work would it take to look at this sort
of data for more examples like "jurrasic world"?
James
On Fri, Jul 15, 2016 at 10:05 AM, Dan Garry dgarry@wikimedia.org wrote:
...
On 15 July 2016 at 08:44, James Heilman jmh649@gmail.com wrote:
...
Thanks for the in depth discussion. So if the terms people are using
that
...
result in "zero search results" are typically gibberish why do we care
if
...
30% of our searches result in "zero search results"? A big deal was made
about this a while ago.
Good question! I originally used to say that it was my aspiration that
users should never get zero results when searching Wikipedia. As a result
of Trey's analysis, I don't say that any more. ;-) There are many
legitimate cases where users should get zero results. However, there are
still tons of examples of where giving users zero results is incorrect;
"jurrasic world" was a prominent example of that.
It's still not quite right to say that *all* the terms that people use to
get zero results are gibberish. There is an extremely long tail
https://en.wikipedia.org/wiki/Long_tail of zero results queries that
aren't gibberish, it's just that the top 100 are dominated by gibberish.
This would mean we'd have to release many, many more than the top 100,
which significantly increases the risk of releasing personal information.
...
If one was just to look at those search terms that more than 100 IPs
searched for would that not remove the concerns about anonymity? One
could
...
also limit the length of the searches displaced to 50 characters. And
just
...
provide the first 100 with an initial human review to make sure we are
not
...
miss anything.
The problem with this is that there are still no guarantees. What if you
saw the search query "DF198671E"? You might not think anything of it, but
I
would recognise it as an example of a national insurance number
https://en.wikipedia.org/wiki/National_Insurance_number, the British
equivalent of a social security number [1]. There's always going to be the
potential that we accidentally release something sensitive when we release
arbitrary user input, even if it's manually examined by humans.
So, in summary:

The top 100 zero results queries are dominated by gibberish.
There's a long tail of zero results queries, meaning we'd have to

reduce many more than the top 100.

Manually examining the top zero results queries is not a foolproof

way
   of eliminating personal data since it's arbitrary user input.
I'm happy to answer any questions. :-)
Thanks,
Dan
[1]: Don't panic, this example national insurance number is actually
invalid. ;-)
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
--
James Heilman
MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)