New subject: Fwd: [discovery] Fwd: Improving search (sort of)

30 Jul 2016

Forwarding to the Wikimedia mailing list, I'm sorry for the lateness!

--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation

---------- Forwarded message ----------
From: Trey Jones &lt;tjones(a)wikimedia.org&gt;
Date: Mon, Jul 25, 2016 at 11:58 AM
Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of)
To: A public mailing list about Wikimedia Search and Discovery projects <
discovery(a)lists.wikimedia.org&gt;
Cc: James Heilman &lt;jmh649(a)gmail.com&gt;

I decided to look into this as my 10% project last week. It ended up being
a 15% project, but I wanted to finish it up.

I carefully reviewed and categorized the top 100 "unsuccessful" (i.e.,
zero-results) queries from May 2016, and skimmed the top 1,000 from May,
and skimmed and compared the top 100 / 1,000 for June.

The top result (with several variants in the top 100) is a porn site that
has had a wiki page created and deleted several times. Various websites
round out the top 10. Internet personalities and websites dominate the top
100 and several have had pages created and deleted over the years. There's
strong evidence of links being used for some queries—though I didn't try to
track them down. There's plenty of personally identifiable information in
the top 1000 most frequent queries. More than 10% of the queries (by
volume) get good results from the completion suggester or "did you mean"
spelling suggestions, and more than 10% have some results approximately two
months later (i.e., late last week).

Obvious refinements to the search strategy would eliminate so many
high-frequency queries that any useful mining would be down to slogging
through the low-impact long tail.

I don’t think there’s a lot here worth extracting, though others may
disagree. The privacy concerns expressed earlier are genuine, and simple
attempts to filter PII (using patterns, minimum IP counts, etc) are not
guaranteed to be effective.

For lots more details (but no actual queries), see here:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Sea…

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones &lt;tjones(a)wikimedia.org&gt; wrote:

...
  Finally, if this is important enough and the task gets
prioritized, I'd be
 willing to dive back in and go through the process once and pull out the
 top zero-results queries, this time with basic bot exclusion and IP
 deduplication—which we didn't do early on because we didn't realize what a
 mess the data was. We could process a week or a month of data and
 categorize the top 100 to 500 results in terms of personal info, junk,
 porn, and whatever other categories we want or that bubble up from the
 data, and perhaps publish the non-personal-info part of the list as an
 example, either to persuade ourselves that this is worth pursuing, or as a
 clearer counter to future calls to do so.
 —Trey

> 
...
  ---------- Forwarded message ----------
  From: "James Heilman"
&lt;jmh649(a)gmail.com&gt;
 Date: Jul 15, 2016 06:33
 Subject: [Wikimedia-l] Improving search (sort of)
 To: "Wikimedia Mailing List" &lt;wikimedia-l(a)lists.wikimedia.org&gt;
 Cc:

 A while ago I requested a list of the "most frequently searched for terms
 for which no Wikipedia articles are returned". This would allow the
 community to than create redirect or new pages as appropriate and help
 address the "zero results rate" of about 30%.

 While we are still waiting for this data I have recently come across a
 list
 of the most frequently clicked on redlinks on En WP produced by Andrew
 West
 https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of
 these can be reasonably addressed with a redirect as the issue is often
 capitals.

 Do anyone know where things are at with respect to producing the list of
 most search for terms that return nothing?

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 _______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery