Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of) - Discovery

16 Jul 2016

Hey James,

When we first started looking at zero results rate (ZRR), it was an easy
metric to calculate, and it was surprisingly high. We still look at ZRR
<https://searchdata.wmflabs.org/metrics/#failure_rate> because it is so
easy to measure, and anything that improves it is probably a net positive
(note the big dip when the new completion suggester was deployed!!), but we
have more complex metrics that we prefer. There's user engagement
<https://searchdata.wmflabs.org/metrics/#kpi_augmented_clickthroughs>/augmented
clickthroughs, which combines clicks and dwell time and other user
activity. We also use historical click data in a metric that improves when
we move clicked-on results higher in the results list, which we use with
the Relevance Forge
<https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/relevanceForge>
.

And I didn't mean to give the impression that *most* zero-results queries
are gibberish, though many, many are. And that was something we didn't
really know a year ago. There are also non-gibberish results that correctly
get zero results, like most DOI
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#DOI>
and many media player
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#TV_Episodes_.2F_Movies.E2.80.94.22....22_film>
queries. We also see a lot of non-notable (not-yet-notable?) public figures
(local bands, online artists, youtube musicians), and sometimes just random
names.

The discussion in response to Dan's original comment in Phab mentions some
approaches to reduce the risk of automatically releasing private info, but
I still take an absolute stand against unreviewed release. If I can get a
few hundred people to click on a link like this
<https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search="James+is+a+nice+guy">,
I can get any message I want on that list. (Curious? Did you click?) The
message could be less anonymous and much more obnoxious, obviously.

50 character limits won't stop emails and phone numbers from making the
list (which invites spam and cranks). Those can be filtered, but not
perfectly.

I've only looked at these top lists by day in the past, but on that time
scale the top results are usually under 1000 count (and that includes IP
duplicates), so the list of queries with 100 IPs might also be very small.

As I said, I'm happy to do the data slogging to try this in a better
fashion if this task is prioritized, and I'd be happy to be wrong about the
quality of the results, but I'm still not hopeful.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Fri, Jul 15, 2016 at 11:44 AM, James Heilman &lt;jmh649(a)gmail.com&gt; wrote:

...
  Hey Trey

 Thanks for the in depth discussion. So if the terms people are using that
 result in "zero search results" are typically gibberish why do we care if
 30% of our searches result in "zero search results"? A big deal was made
 about this a while ago.

 If one was just to look at those search terms that more than 100 IPs
 searched for would that not remove the concerns about anonymity? One could
 also limit the length of the searches displaced to 50 characters. And just
 provide the first 100 with an initial human review to make sure we are not
 miss anything.

 James

 On Fri, Jul 15, 2016 at 9:31 AM, Trey Jones &lt;tjones(a)wikimedia.org&gt; wrote:

  Pine, thanks for the forward. Regulars on the
Discovery list may know me,
 but James probably does not. I've manually reviewed tens of thousands of
 generally poorly performing queries (fewer than 3 results) and skimmed
 hundreds of thousands more from many of the top 20 Wikipedias—and to a
 lesser extent other projects—over the year I've been at the WMF and in
 Discovery. You can see my list of write ups here
 <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes>.

 So I want to say that this is an awesome idea—which is why many people
 have thought of it. It was apparently one of the first ideas the Discovery
 department had when they formed (see Dan's notes linked below). It was also
 one of the first ideas I had when I joined Discovery a few months later.

 Dan Garry's notes on T8373
 <https://phabricator.wikimedia.org/T8373#1856036> and the following
 discussion pretty much quash the idea of automated extraction and
 publication from a privacy perspective. People not only divulge their own
 personal information, they also divulge other people's personal
 information. One example: some guy outside the U.S. was methodically
 searching long lists of real addresses in Las Vegas. I will second Dan's
 comments in the T8373 discussion; all kinds of personal data end up in
 search queries. A dump of search queries *was* provided in September 2012

<https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/>,
 but had to be withdrawn over privacy concerns.

 Another concern for auto-published data: never underestimate the power of
 random groups of bored people on the internet. 4chan decided to arrange
 Time Magazine poll results

<https://techcrunch.com/2009/04/27/time-magazine-throws-up-its-hands-as-it-gets-pwned-by-4chan/>
so
 the first letter spelled out a weird message. It would be easy for 4chan,
 Reddit, and other communities to get any message they want on that list if
 they happened to notice that it existed. See also Boaty McBoatface
 <https://en.wikipedia.org/wiki/RRS_Sir_David_Attenborough#Name> and Mountain
 Dew "Diabeetus"

<https://storify.com/cbccommunity/hitler-did-nothing-wrong-wins-crowdsourced-mounta>
 (which is not at all the worst thing on *that* list). We don't want to
 have to try to defend against that.

 In my experience, the quality of what's actually there isn't that great.
 One of my first tasks when I joined Discovery was to look at daily lists of
 top 100 zero-results queries that had been gathered automatically. I was
 excited by this same idea. The top 100 zero-results query list was a
 wasteland. (Minimal notes on some of what I found are here

<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Highly_repeated_searches>.)
 We could make it better by focusing on human-ish searchers, using basic
 bot-exclusion techniques

<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki#Random_sampling>,
 ignoring duplicates from the same IP, and such, but I don't think it would
 help. And while Wikipedia is not for children, there could be an annoying
 amount of explicit adult material on the list, too. We would probably find
 some interesting spellings of Facebook and WhatsApp, though.

 If we're really excited about this, I could imagine using better
 techniques to pull zero-results queries and see if anything good is in
 there, but we'd have to commit to some sort of review before we publish it.
 For example, Discernatron <https://discernatron.wmflabs.org/> data,
 after consulting with legal, is reviewed independently by two people, who
 then have to reconcile any discrepancies, before being made public. So I
 think we'd need an ongoing commitment to have at least two people under NDA
 who would review any list before publication. 500-600 queries takes a
 couple hours per person (we’ve done that for the Discernatron), so the top
 100 would probably be less than an hour. I'd even be willing to help with
 the review (as I am for Discernatron) if we found there was something
 useful in there—but I'm not terribly hopeful. We'd also need more people to
 efficiently and effectively review queries for other languages if we wanted
 to extend this beyond English Wikipedia.

 Finally, if this is important enough and the task gets prioritized, I'd
 be willing to dive back in and go through the process once and pull out the
 top zero-results queries, this time with basic bot exclusion and IP
 deduplication—which we didn't do early on because we didn't realize what a
 mess the data was. We could process a week or a month of data and
 categorize the top 100 to 500 results in terms of personal info, junk,
 porn, and whatever other categories we want or that bubble up from the
 data, and perhaps publish the non-personal-info part of the list as an
 example, either to persuade ourselves that this is worth pursuing, or as a
 clearer counter to future calls to do so.
 —Trey

 Trey Jones
 Software Engineer, Discovery
 Wikimedia Foundation

 On Fri, Jul 15, 2016 at 10:09 AM, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:

  Forwarding

 Pine
 ---------- Forwarded message ----------
 From: "James Heilman" &lt;jmh649(a)gmail.com&gt;
 Date: Jul 15, 2016 06:33
 Subject: [Wikimedia-l] Improving search (sort of)
 To: "Wikimedia Mailing List" &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Cc:

 A while ago I requested a list of the "most frequently searched for terms
 for which no Wikipedia articles are returned". This would allow the
 community to than create redirect or new pages as appropriate and help
 address the "zero results rate" of about 30%.

 While we are still waiting for this data I have recently come across a
 list
 of the most frequently clicked on redlinks on En WP produced by Andrew
 West
 https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many
 of
 these can be reasonably addressed with a redirect as the issue is often
 capitals.

 Do anyone know where things are at with respect to producing the list of
 most search for terms that return nothing?

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 New messages to: Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

 _______________________________________________
 discovery mailing list
 discovery(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com