Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of) - Wikimedia-l

15 Jul 2016

I use search to find typos and misused words, so I'm guilty of some of the
gibberish looking searches
<https://en.wikipedia.org/wiki/User:WereSpielChequers/searches>.

If we are concerned that some common searches could have Privacy
implications, why not create it as a deleted page and announce its
(non)existence on the admins noticeboard?

WSC

On 15 July 2016 at 19:25, &lt;wikimedia-l-request(a)lists.wikimedia.org&gt; wrote:

...
  Send Wikimedia-l mailing list submissions to
         wikimedia-l(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
         https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
 or, via email, send a message with subject or body 'help' to
         wikimedia-l-request(a)lists.wikimedia.org

 You can reach the person managing the list at
         wikimedia-l-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Wikimedia-l digest..."

 Today's Topics:

    1. Re: [discovery] Fwd: Improving search (sort of) (Dan Garry)
    2. Re: [discovery] Fwd: Improving search (sort of) (James Heilman)
    3. Re: [discovery] Fwd: Improving search (sort of) (James Heilman)
    4. Re: [discovery] Fwd: Improving search (sort of) (Robert Fernandez)
    5. Re: [discovery] Fwd: Improving search (sort of) (Nathan)

 ----------------------------------------------------------------------

 Message: 1
 Date: Fri, 15 Jul 2016 09:05:54 -0700
 From: Dan Garry &lt;dgarry(a)wikimedia.org&gt;
 To: Wikimedia Mailing List &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Cc: A public mailing list about Wikimedia Search and Discovery
         projects &lt;discovery(a)lists.wikimedia.org&gt;rg>, Trey Jones
         &lt;tjones(a)wikimedia.org&gt;
 Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)
 Message-ID:
         <
 CAOW03MHsgowW-gAd6uDJs_ONvA8ZNiUyKcCrP2evOK1B+2DOZA(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 On 15 July 2016 at 08:44, James Heilman &lt;jmh649(a)gmail.com&gt; wrote:

 Thanks for the in depth discussion. So if the terms people are using that
 result in "zero search results" are typically gibberish why do we care if
 30% of our searches result in "zero search results"? A big deal was made
 about this a while ago.

 Good question! I originally used to say that it was my aspiration that
 users should never get zero results when searching Wikipedia. As a result
 of Trey's analysis, I don't say that any more. ;-) There are many
 legitimate cases where users should get zero results. However, there are
 still tons of examples of where giving users zero results is incorrect;
 "jurrasic world" was a prominent example of that.

 It's still not quite right to say that *all* the terms that people use to
 get zero results are gibberish. There is an extremely long tail
 <https://en.wikipedia.org/wiki/Long_tail> of zero results queries that
 aren't gibberish, it's just that the top 100 are dominated by gibberish.
 This would mean we'd have to release many, many more than the top 100,
 which significantly increases the risk of releasing personal information.

  If one was just to look at those search terms
that more than 100 IPs
 searched for would that not remove the concerns about anonymity? One  could
  also limit the length of the searches displaced
to 50 characters. And  just
  provide the first 100 with an initial human
review to make sure we are  not
  miss anything.

 The problem with this is that there are still no guarantees. What if you
 saw the search query "DF198671E"? You might not think anything of it, but I
 would recognise it as an example of a national insurance number
 <https://en.wikipedia.org/wiki/National_Insurance_number>, the British
 equivalent of a social security number [1]. There's always going to be the
 potential that we accidentally release something sensitive when we release
 arbitrary user input, even if it's manually examined by humans.

 So, in summary:

    - The top 100 zero results queries are dominated by gibberish.
    - There's a long tail of zero results queries, meaning we'd have to
    reduce many more than the top 100.
    - Manually examining the top zero results queries is not a foolproof way
    of eliminating personal data since it's arbitrary user input.

 I'm happy to answer any questions. :-)

 Thanks,
 Dan

 [1]: Don't panic, this example national insurance number is actually
 invalid. ;-)

 --
 Dan Garry
 Lead Product Manager, Discovery
 Wikimedia Foundation

 ------------------------------

 Message: 2
 Date: Fri, 15 Jul 2016 10:19:08 -0600
 From: James Heilman &lt;jmh649(a)gmail.com&gt;
 To: Wikimedia Mailing List &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Cc: A public mailing list about Wikimedia Search and Discovery
         projects &lt;discovery(a)lists.wikimedia.org&gt;rg>, Trey Jones
         &lt;tjones(a)wikimedia.org&gt;
 Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)
 Message-ID:
         <CAF1en7WBrxDJ_H3J=
 eN5NZmGueQEZ+txOGAG5u4af3FwTVV55Q(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 The "jurrasic world" example is a good one as it was "fixed" by
User:Foxj
 adding a redirect
 https://en.wikipedia.org/w/index.php?title=Jurrasic_world&action=history

 Agree we would need to be careful. The chance of many different IPs all
 searching for "DF198671E" is low but I agree not zero and we would need to
 have people run the results before they are displayed.

 I guess the question is how much work would it take to look at this sort of
 data for more examples like "jurrasic world"?

 James

 On Fri, Jul 15, 2016 at 10:05 AM, Dan Garry &lt;dgarry(a)wikimedia.org&gt; wrote:

  On 15 July 2016 at 08:44, James Heilman
&lt;jmh649(a)gmail.com&gt; wrote:
   > Thanks for the in depth discussion.
So if the terms people are using  that
  > result in "zero search results"
are typically gibberish why do we care  if
  > 30% of our searches result in "zero
search results"? A big deal was  made
 > > about this a while ago.
 >  
   > Good question! I originally used to say that it was my
aspiration that
 > users should never get zero results when searching Wikipedia. As a result
 > of Trey's analysis, I don't say that any more. ;-) There are many
 > legitimate cases where users should get zero results. However, there are
 > still tons of examples of where giving users zero results is incorrect;
 > "jurrasic world" was a prominent example of that.
   > It's still not quite right to say
that *all* the terms that people use to
 > get zero results are gibberish. There is an extremely long tail
 > <https://en.wikipedia.org/wiki/Long_tail> of zero results queries that
 > aren't gibberish, it's just that the top 100 are dominated by gibberish.
 > This would mean we'd have to release many, many more than the top 100,
 > which significantly increases the risk of releasing personal information.

 > > If one was just to look at those search terms that more than 100 IPs
 > > searched for would that not remove the concerns about anonymity? One
 > could
 >  also limit the length of the searches
displaced to 50 characters. And  > just
 >  provide the first 100 with an initial human
review to make sure we are  > not
 > > miss anything.
 >  
   > The problem with this is that there are still no guarantees.
What if you
 > saw the search query "DF198671E"? You might not think anything of it,
 but I
  would recognise it as an example of a national
insurance number
 <https://en.wikipedia.org/wiki/National_Insurance_number>, the British
 equivalent of a social security number [1]. There's always going to be  the
  potential that we accidentally release something
sensitive when we  release
  arbitrary user input, even if it's manually
examined by humans.

 So, in summary:

    - The top 100 zero results queries are dominated by gibberish.
    - There's a long tail of zero results queries, meaning we'd have to
    reduce many more than the top 100.
    - Manually examining the top zero results queries is not a foolproof  way
     of eliminating personal data since it's
arbitrary user input.

 I'm happy to answer any questions. :-)

 Thanks,
 Dan

 [1]: Don't panic, this example national insurance number is actually
 invalid. ;-)

 --
 Dan Garry
 Lead Product Manager, Discovery
 Wikimedia Foundation
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> 

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com

 ------------------------------

 Message: 3
 Date: Fri, 15 Jul 2016 10:25:54 -0600
 From: James Heilman &lt;jmh649(a)gmail.com&gt;
 To: Wikimedia Mailing List &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Cc: A public mailing list about Wikimedia Search and Discovery
         projects &lt;discovery(a)lists.wikimedia.org&gt;rg>, Trey Jones
         &lt;tjones(a)wikimedia.org&gt;
 Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)
 Message-ID:
         <
 CAF1en7VYkakrzZf6bMcCtv1dBj2NROSnY1Gv8BwyOEkg+yiTSw(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 Forwarded at the request of Trey Jones

 Hey James,

 When we first started looking at zero results rate (ZRR), it was an easy
 metric to calculate, and it was surprisingly high. We still look at ZRR
 <https://searchdata.wmflabs.org/metrics/#failure_rate> because it is so
 easy to measure, and anything that improves it is probably a net positive
 (note the big dip when the new completion suggester was deployed!!), but we
 have more complex metrics that we prefer. There's user engagement
 <https://searchdata.wmflabs.org/metrics/#kpi_augmented_clickthroughs
 /augmented  clickthroughs, which combines
clicks and dwell time and other user
 activity. We also use historical click data in a metric that improves when
 we move clicked-on results higher in the results list, which we use with
 the Relevance Forge
 <
 https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/relevan…
   .

 And I didn't mean to give the impression that *most* zero-results queries
 are gibberish, though many, many are. And that was something we didn't
 really know a year ago. There are also non-gibberish results that correctly
 get zero results, like most DOI
 <

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…
   and
 many media player
 <

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…
   queries.
 We also see a lot of non-notable (not-yet-notable?) public figures (local
 bands, online artists, youtube musicians), and sometimes just random names.

 The discussion in response to Dan's original comment in Phab mentions some
 approaches to reduce the risk of automatically releasing private info, but
 I still take an absolute stand against unreviewed release. If I can get a
 few hundred people to click on a link like this
 <

https://en.wikipedia.org/w/index.php?title=Special:Search&profile=defau…
 ,  I can get any message I want on that
list. (Curious? Did you click?) The
 message could be less anonymous and much more obnoxious, obviously.

 50 character limits won't stop emails and phone numbers from making the
 list (which invites spam and cranks). Those can be filtered, but not
 perfectly.

 I've only looked at these top lists by day in the past, but on that time
 scale the top results are usually under 1000 count (and that includes IP
 duplicates), so the list of queries with 100 IPs might also be very small.

 As I said, I'm happy to do the data slogging to try this in a better
 fashion if this task is prioritized, and I'd be happy to be wrong about the
 quality of the results, but I'm still not hopeful.

 —Trey

 Trey Jones
 Software Engineer, Discovery
 Wikimedia Foundation

 On Fri, Jul 15, 2016 at 10:19 AM, James Heilman &lt;jmh649(a)gmail.com&gt; wrote:

 > The "jurrasic world" example is a good one as it was "fixed" by
User:Foxj
 > adding a redirect
 > https://en.wikipedia.org/w/index.php?title=Jurrasic_world&action=history
   > Agree we would need to be careful.
The chance of many different IPs all
 > searching for "DF198671E" is low but I agree not zero and we would need
 > to have people run the results before they are displayed.
   > I guess the question is how much work
would it take to look at this sort
 > of data for more examples like "jurrasic world"?
   > James
   > On Fri, Jul 15, 2016 at 10:05 AM, Dan
Garry &lt;dgarry(a)wikimedia.org&gt;
 wrote:
   >> On 15 July 2016 at 08:44, James
Heilman &lt;jmh649(a)gmail.com&gt; wrote:
 >>   >> > Thanks for the in
depth discussion. So if the terms people are using
 >> that
 >> > result in "zero search results" are typically gibberish why do we
care
 >> if
 >> > 30% of our searches result in "zero search results"? A big deal
was
 made
 >> > about this a while ago.
 >>   >>
 >> Good question! I originally used to say that it was my aspiration that
 >> users should never get zero results when searching Wikipedia. As a
 result
 > of Trey's analysis, I don't say that
any more. ;-) There are many
> legitimate cases where users should get zero results. However, there are
> still tons of examples of where giving users zero results is incorrect;
> "jurrasic world" was a prominent example of that.
>
> It's still not quite right to say that *all* the terms that people use  to
 > get zero results are gibberish. There is an
extremely long tail
> <https://en.wikipedia.org/wiki/Long_tail> of zero results queries that
> aren't gibberish, it's just that the top 100 are dominated by gibberish.
> This would mean we'd have to release many, many more than the top 100,
> which significantly increases the risk of releasing personal  information.
 >>
 >>
 >> > If one was just to look at those search terms that more than 100 IPs
 >> > searched for would that not remove the concerns about anonymity? One
 >> could
 >>  also limit the length of the searches
displaced to 50 characters. And  >> just
 >>  provide the first 100 with an initial
human review to make sure we are  >> not
 >> > miss anything.
 >>   >>
 >> The problem with this is that there are still no guarantees. What if you
 >> saw the search query "DF198671E"? You might not think anything of it,
 but
 > I
> would recognise it as an example of a national insurance number
> <https://en.wikipedia.org/wiki/National_Insurance_number>, the British
> equivalent of a social security number [1]. There's always going to be 
the
 > potential that we accidentally release
something sensitive when we  release
 >> arbitrary user input, even if it's manually examined by humans.
 >>
 >> So, in summary:
 >>
 >>    - The top 100 zero results queries are dominated by gibberish.
 >>    - There's a long tail of zero results queries, meaning we'd have to
 >>    reduce many more than the top 100.
 >>    - Manually examining the top zero results queries is not a foolproof
 >> way
 >>    of eliminating personal data since it's arbitrary user input.
 >>
 >> I'm happy to answer any questions. :-)
 >>
 >> Thanks,
 >> Dan
 >>
 >> [1]: Don't panic, this example national insurance number is actually
 >> invalid. ;-)
 >>
 >> --
 >> Dan Garry
 >> Lead Product Manager, Discovery
 >> Wikimedia Foundation
 >> _______________________________________________
 >> Wikimedia-l mailing list, guidelines at:
 >> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 >> New messages to: Wikimedia-l(a)lists.wikimedia.org
 >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 >> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

   > --
 > James Heilman
 > MD, CCFP-EM, Wikipedian
   > The Wikipedia Open Textbook of
Medicine
 > www.opentextbookofmedicine.com

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com

 ------------------------------

 Message: 4
 Date: Fri, 15 Jul 2016 14:15:31 -0400
 From: Robert Fernandez &lt;wikigamaliel(a)gmail.com&gt;
 To: Wikimedia Mailing List &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)
 Message-ID:
         <
 CAMY8yAWisp507c_F3hJbcRWT20NZjyczN0oiZWPZ-UwggRi38Q(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

  If I can get a  few hundred people to click
on a link like this
 <

https://en.wikipedia.org/w/index.php?title=Special:Search&profile=defau…
 ,  I can get any message I want on that
list. (Curious? Did you click?) The
 message could be less anonymous and much more obnoxious, obviously

 They could vandalize any one of over ten million pages on the English
 Wikipedia and get the same result.  We should be conscious of the
 dangers but we can easily route around them like we do with other
 kinds of vandalism.

 ------------------------------

 Message: 5
 Date: Fri, 15 Jul 2016 14:25:08 -0400
 From: Nathan &lt;nawrich(a)gmail.com&gt;
 To: wikigamaliel(a)gmail.com,  Wikimedia Mailing List
         &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)
 Message-ID:
         <CALKX9dTwh=
 BDVPFtiT6tGw53XRccC8TbyZd2kJ9benKx18Jj5w(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 How hard would it be to ask for search feedback on search results, perhaps
 piloting with some small subset of zero-result searches? For 1/1000 ZRRs,
 prompt the user to provide some type of useful information about why there
 should be results, or if there ought to be, or what category of information
 the searcher was looking for, etc. You'd get junk and noise, but it might
 be one way to filter out a lot of the gibberish. You could also ask people
 to agree to make their failed search part of a publicly visible list,
 although this could of course be gamed.

 ------------------------------

 Subject: Digest Footer

 _______________________________________________
 Wikimedia-l mailing list,  guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

 ------------------------------

 End of Wikimedia-l Digest, Vol 148, Issue 26
 ********************************************