I use search to find typos and misused words, so I'm guilty of some of the gibberish looking searches https://en.wikipedia.org/wiki/User:WereSpielChequers/searches.
If we are concerned that some common searches could have Privacy implications, why not create it as a deleted page and announce its (non)existence on the admins noticeboard?
WSC
On 15 July 2016 at 19:25, wikimedia-l-request@lists.wikimedia.org wrote:
Send Wikimedia-l mailing list submissions to wikimedia-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wikimedia-l or, via email, send a message with subject or body 'help' to wikimedia-l-request@lists.wikimedia.org
You can reach the person managing the list at wikimedia-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wikimedia-l digest..."
Today's Topics:
- Re: [discovery] Fwd: Improving search (sort of) (Dan Garry)
- Re: [discovery] Fwd: Improving search (sort of) (James Heilman)
- Re: [discovery] Fwd: Improving search (sort of) (James Heilman)
- Re: [discovery] Fwd: Improving search (sort of) (Robert Fernandez)
- Re: [discovery] Fwd: Improving search (sort of) (Nathan)
Message: 1 Date: Fri, 15 Jul 2016 09:05:54 -0700 From: Dan Garry dgarry@wikimedia.org To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Cc: A public mailing list about Wikimedia Search and Discovery projects discovery@lists.wikimedia.org, Trey Jones tjones@wikimedia.org Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of) Message-ID: < CAOW03MHsgowW-gAd6uDJs_ONvA8ZNiUyKcCrP2evOK1B+2DOZA@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
On 15 July 2016 at 08:44, James Heilman jmh649@gmail.com wrote:
Thanks for the in depth discussion. So if the terms people are using that result in "zero search results" are typically gibberish why do we care if 30% of our searches result in "zero search results"? A big deal was made about this a while ago.
Good question! I originally used to say that it was my aspiration that users should never get zero results when searching Wikipedia. As a result of Trey's analysis, I don't say that any more. ;-) There are many legitimate cases where users should get zero results. However, there are still tons of examples of where giving users zero results is incorrect; "jurrasic world" was a prominent example of that.
It's still not quite right to say that *all* the terms that people use to get zero results are gibberish. There is an extremely long tail https://en.wikipedia.org/wiki/Long_tail of zero results queries that aren't gibberish, it's just that the top 100 are dominated by gibberish. This would mean we'd have to release many, many more than the top 100, which significantly increases the risk of releasing personal information.
If one was just to look at those search terms that more than 100 IPs searched for would that not remove the concerns about anonymity? One
could
also limit the length of the searches displaced to 50 characters. And
just
provide the first 100 with an initial human review to make sure we are
not
miss anything.
The problem with this is that there are still no guarantees. What if you saw the search query "DF198671E"? You might not think anything of it, but I would recognise it as an example of a national insurance number https://en.wikipedia.org/wiki/National_Insurance_number, the British equivalent of a social security number [1]. There's always going to be the potential that we accidentally release something sensitive when we release arbitrary user input, even if it's manually examined by humans.
So, in summary:
- The top 100 zero results queries are dominated by gibberish.
- There's a long tail of zero results queries, meaning we'd have to
reduce many more than the top 100.
- Manually examining the top zero results queries is not a foolproof way
of eliminating personal data since it's arbitrary user input.
I'm happy to answer any questions. :-)
Thanks, Dan
[1]: Don't panic, this example national insurance number is actually invalid. ;-)
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
Message: 2 Date: Fri, 15 Jul 2016 10:19:08 -0600 From: James Heilman jmh649@gmail.com To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Cc: A public mailing list about Wikimedia Search and Discovery projects discovery@lists.wikimedia.org, Trey Jones tjones@wikimedia.org Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of) Message-ID: <CAF1en7WBrxDJ_H3J= eN5NZmGueQEZ+txOGAG5u4af3FwTVV55Q@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
The "jurrasic world" example is a good one as it was "fixed" by User:Foxj adding a redirect https://en.wikipedia.org/w/index.php?title=Jurrasic_world&action=history
Agree we would need to be careful. The chance of many different IPs all searching for "DF198671E" is low but I agree not zero and we would need to have people run the results before they are displayed.
I guess the question is how much work would it take to look at this sort of data for more examples like "jurrasic world"?
James
On Fri, Jul 15, 2016 at 10:05 AM, Dan Garry dgarry@wikimedia.org wrote:
On 15 July 2016 at 08:44, James Heilman jmh649@gmail.com wrote:
Thanks for the in depth discussion. So if the terms people are using
that
result in "zero search results" are typically gibberish why do we care
if
30% of our searches result in "zero search results"? A big deal was
made
about this a while ago.
Good question! I originally used to say that it was my aspiration that users should never get zero results when searching Wikipedia. As a result of Trey's analysis, I don't say that any more. ;-) There are many legitimate cases where users should get zero results. However, there are still tons of examples of where giving users zero results is incorrect; "jurrasic world" was a prominent example of that.
It's still not quite right to say that *all* the terms that people use to get zero results are gibberish. There is an extremely long tail https://en.wikipedia.org/wiki/Long_tail of zero results queries that aren't gibberish, it's just that the top 100 are dominated by gibberish. This would mean we'd have to release many, many more than the top 100, which significantly increases the risk of releasing personal information.
If one was just to look at those search terms that more than 100 IPs searched for would that not remove the concerns about anonymity? One
could
also limit the length of the searches displaced to 50 characters. And
just
provide the first 100 with an initial human review to make sure we are
not
miss anything.
The problem with this is that there are still no guarantees. What if you saw the search query "DF198671E"? You might not think anything of it,
but I
would recognise it as an example of a national insurance number https://en.wikipedia.org/wiki/National_Insurance_number, the British equivalent of a social security number [1]. There's always going to be
the
potential that we accidentally release something sensitive when we
release
arbitrary user input, even if it's manually examined by humans.
So, in summary:
- The top 100 zero results queries are dominated by gibberish.
- There's a long tail of zero results queries, meaning we'd have to
reduce many more than the top 100.
- Manually examining the top zero results queries is not a foolproof
way
of eliminating personal data since it's arbitrary user input.
I'm happy to answer any questions. :-)
Thanks, Dan
[1]: Don't panic, this example national insurance number is actually invalid. ;-)
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
-- James Heilman MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com
Message: 3 Date: Fri, 15 Jul 2016 10:25:54 -0600 From: James Heilman jmh649@gmail.com To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Cc: A public mailing list about Wikimedia Search and Discovery projects discovery@lists.wikimedia.org, Trey Jones tjones@wikimedia.org Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of) Message-ID: < CAF1en7VYkakrzZf6bMcCtv1dBj2NROSnY1Gv8BwyOEkg+yiTSw@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
Forwarded at the request of Trey Jones
Hey James,
When we first started looking at zero results rate (ZRR), it was an easy metric to calculate, and it was surprisingly high. We still look at ZRR https://searchdata.wmflabs.org/metrics/#failure_rate because it is so easy to measure, and anything that improves it is probably a net positive (note the big dip when the new completion suggester was deployed!!), but we have more complex metrics that we prefer. There's user engagement <https://searchdata.wmflabs.org/metrics/#kpi_augmented_clickthroughs
/augmented
clickthroughs, which combines clicks and dwell time and other user activity. We also use historical click data in a metric that improves when we move clicked-on results higher in the results list, which we use with the Relevance Forge < https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/relevanc...
.
And I didn't mean to give the impression that *most* zero-results queries are gibberish, though many, many are. And that was something we didn't really know a year ago. There are also non-gibberish results that correctly get zero results, like most DOI < https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Result...
and many media player < https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Result...
queries. We also see a lot of non-notable (not-yet-notable?) public figures (local bands, online artists, youtube musicians), and sometimes just random names.
The discussion in response to Dan's original comment in Phab mentions some approaches to reduce the risk of automatically releasing private info, but I still take an absolute stand against unreviewed release. If I can get a few hundred people to click on a link like this < https://en.wikipedia.org/w/index.php?title=Special:Search&profile=defaul...
,
I can get any message I want on that list. (Curious? Did you click?) The message could be less anonymous and much more obnoxious, obviously.
50 character limits won't stop emails and phone numbers from making the list (which invites spam and cranks). Those can be filtered, but not perfectly.
I've only looked at these top lists by day in the past, but on that time scale the top results are usually under 1000 count (and that includes IP duplicates), so the list of queries with 100 IPs might also be very small.
As I said, I'm happy to do the data slogging to try this in a better fashion if this task is prioritized, and I'd be happy to be wrong about the quality of the results, but I'm still not hopeful.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Jul 15, 2016 at 10:19 AM, James Heilman jmh649@gmail.com wrote:
The "jurrasic world" example is a good one as it was "fixed" by User:Foxj adding a redirect https://en.wikipedia.org/w/index.php?title=Jurrasic_world&action=history
Agree we would need to be careful. The chance of many different IPs all searching for "DF198671E" is low but I agree not zero and we would need to have people run the results before they are displayed.
I guess the question is how much work would it take to look at this sort of data for more examples like "jurrasic world"?
James
On Fri, Jul 15, 2016 at 10:05 AM, Dan Garry dgarry@wikimedia.org
wrote:
On 15 July 2016 at 08:44, James Heilman jmh649@gmail.com wrote:
Thanks for the in depth discussion. So if the terms people are using
that
result in "zero search results" are typically gibberish why do we care
if
30% of our searches result in "zero search results"? A big deal was
made
about this a while ago.
Good question! I originally used to say that it was my aspiration that users should never get zero results when searching Wikipedia. As a
result
of Trey's analysis, I don't say that any more. ;-) There are many legitimate cases where users should get zero results. However, there are still tons of examples of where giving users zero results is incorrect; "jurrasic world" was a prominent example of that.
It's still not quite right to say that *all* the terms that people use
to
get zero results are gibberish. There is an extremely long tail https://en.wikipedia.org/wiki/Long_tail of zero results queries that aren't gibberish, it's just that the top 100 are dominated by gibberish. This would mean we'd have to release many, many more than the top 100, which significantly increases the risk of releasing personal
information.
If one was just to look at those search terms that more than 100 IPs searched for would that not remove the concerns about anonymity? One
could
also limit the length of the searches displaced to 50 characters. And
just
provide the first 100 with an initial human review to make sure we are
not
miss anything.
The problem with this is that there are still no guarantees. What if you saw the search query "DF198671E"? You might not think anything of it,
but
I would recognise it as an example of a national insurance number https://en.wikipedia.org/wiki/National_Insurance_number, the British equivalent of a social security number [1]. There's always going to be
the
potential that we accidentally release something sensitive when we
release
arbitrary user input, even if it's manually examined by humans.
So, in summary:
- The top 100 zero results queries are dominated by gibberish.
- There's a long tail of zero results queries, meaning we'd have to
reduce many more than the top 100.
- Manually examining the top zero results queries is not a foolproof
way of eliminating personal data since it's arbitrary user input.
I'm happy to answer any questions. :-)
Thanks, Dan
[1]: Don't panic, this example national insurance number is actually invalid. ;-)
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
-- James Heilman MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com
-- James Heilman MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com
Message: 4 Date: Fri, 15 Jul 2016 14:15:31 -0400 From: Robert Fernandez wikigamaliel@gmail.com To: Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of) Message-ID: < CAMY8yAWisp507c_F3hJbcRWT20NZjyczN0oiZWPZ-UwggRi38Q@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
If I can get a
few hundred people to click on a link like this < https://en.wikipedia.org/w/index.php?title=Special:Search&profile=defaul...
,
I can get any message I want on that list. (Curious? Did you click?) The message could be less anonymous and much more obnoxious, obviously
They could vandalize any one of over ten million pages on the English Wikipedia and get the same result. We should be conscious of the dangers but we can easily route around them like we do with other kinds of vandalism.
Message: 5 Date: Fri, 15 Jul 2016 14:25:08 -0400 From: Nathan nawrich@gmail.com To: wikigamaliel@gmail.com, Wikimedia Mailing List wikimedia-l@lists.wikimedia.org Subject: Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of) Message-ID: <CALKX9dTwh= BDVPFtiT6tGw53XRccC8TbyZd2kJ9benKx18Jj5w@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
How hard would it be to ask for search feedback on search results, perhaps piloting with some small subset of zero-result searches? For 1/1000 ZRRs, prompt the user to provide some type of useful information about why there should be results, or if there ought to be, or what category of information the searcher was looking for, etc. You'd get junk and noise, but it might be one way to filter out a lot of the gibberish. You could also ask people to agree to make their failed search part of a publicly visible list, although this could of course be gamed.
Subject: Digest Footer
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
End of Wikimedia-l Digest, Vol 148, Issue 26
wikimedia-l@lists.wikimedia.org