Fwd: [Wikimedia-l] Improving search (sort of)

List overview All Threads
Download

newer

older

Microsoft adding research tool to...

Re: [discovery] [MediaWiki-l]...

Pine W

15 Jul 2016 15 Jul '16

10:09 p.m.

Forwarding

Pine ---------- Forwarded message ---------- From: "James Heilman" jmh649@gmail.com Date: Jul 15, 2016 06:33 Subject: [Wikimedia-l] Improving search (sort of) To: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org Cc:

A while ago I requested a list of the "most frequently searched for terms for which no Wikipedia articles are returned". This would allow the community to than create redirect or new pages as appropriate and help address the "zero results rate" of about 30%.

While we are still waiting for this data I have recently come across a list of the most frequently clicked on redlinks on En WP produced by Andrew West https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of these can be reasonably addressed with a redirect as the issue is often capitals.

Do anyone know where things are at with respect to producing the list of most search for terms that return nothing?

-- James Heilman MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

Attachments:

attachment.htm (text/html — 2.2 KB)

Show replies by date

Trey Jones

15 Jul 15 Jul

11:31 p.m.

Pine, thanks for the forward. Regulars on the Discovery list may know me, but James probably does not. I've manually reviewed tens of thousands of generally poorly performing queries (fewer than 3 results) and skimmed hundreds of thousands more from many of the top 20 Wikipedias—and to a lesser extent other projects—over the year I've been at the WMF and in Discovery. You can see my list of write ups here https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes.

So I want to say that this is an awesome idea—which is why many people have thought of it. It was apparently one of the first ideas the Discovery department had when they formed (see Dan's notes linked below). It was also one of the first ideas I had when I joined Discovery a few months later.

Dan Garry's notes on T8373 https://phabricator.wikimedia.org/T8373#1856036 and the following discussion pretty much quash the idea of automated extraction and publication from a privacy perspective. People not only divulge their own personal information, they also divulge other people's personal information. One example: some guy outside the U.S. was methodically searching long lists of real addresses in Las Vegas. I will second Dan's comments in the T8373 discussion; all kinds of personal data end up in search queries. A dump of search queries *was* provided in September 2012 https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/, but had to be withdrawn over privacy concerns.

Another concern for auto-published data: never underestimate the power of random groups of bored people on the internet. 4chan decided to arrange Time Magazine poll results https://techcrunch.com/2009/04/27/time-magazine-throws-up-its-hands-as-it-gets-pwned-by-4chan/ so the first letter spelled out a weird message. It would be easy for 4chan, Reddit, and other communities to get any message they want on that list if they happened to notice that it existed. See also Boaty McBoatface https://en.wikipedia.org/wiki/RRS_Sir_David_Attenborough#Name and Mountain Dew "Diabeetus" https://storify.com/cbccommunity/hitler-did-nothing-wrong-wins-crowdsourced-mounta (which is not at all the worst thing on *that* list). We don't want to have to try to defend against that.

In my experience, the quality of what's actually there isn't that great. One of my first tasks when I joined Discovery was to look at daily lists of top 100 zero-results queries that had been gathered automatically. I was excited by this same idea. The top 100 zero-results query list was a wasteland. (Minimal notes on some of what I found are here https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Highly_repeated_searches.) We could make it better by focusing on human-ish searchers, using basic bot-exclusion techniques https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki#Random_sampling, ignoring duplicates from the same IP, and such, but I don't think it would help. And while Wikipedia is not for children, there could be an annoying amount of explicit adult material on the list, too. We would probably find some interesting spellings of Facebook and WhatsApp, though.

If we're really excited about this, I could imagine using better techniques to pull zero-results queries and see if anything good is in there, but we'd have to commit to some sort of review before we publish it. For example, Discernatron https://discernatron.wmflabs.org/ data, after consulting with legal, is reviewed independently by two people, who then have to reconcile any discrepancies, before being made public. So I think we'd need an ongoing commitment to have at least two people under NDA who would review any list before publication. 500-600 queries takes a couple hours per person (we’ve done that for the Discernatron), so the top 100 would probably be less than an hour. I'd even be willing to help with the review (as I am for Discernatron) if we found there was something useful in there—but I'm not terribly hopeful. We'd also need more people to efficiently and effectively review queries for other languages if we wanted to extend this beyond English Wikipedia.

Finally, if this is important enough and the task gets prioritized, I'd be willing to dive back in and go through the process once and pull out the top zero-results queries, this time with basic bot exclusion and IP deduplication—which we didn't do early on because we didn't realize what a mess the data was. We could process a week or a month of data and categorize the top 100 to 500 results in terms of personal info, junk, porn, and whatever other categories we want or that bubble up from the data, and perhaps publish the non-personal-info part of the list as an example, either to persuade ourselves that this is worth pursuing, or as a clearer counter to future calls to do so. —Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

On Fri, Jul 15, 2016 at 10:09 AM, Pine W wiki.pine@gmail.com wrote:

...

Forwarding

Pine ---------- Forwarded message ---------- From: "James Heilman" jmh649@gmail.com Date: Jul 15, 2016 06:33 Subject: [Wikimedia-l] Improving search (sort of) To: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org Cc:

A while ago I requested a list of the "most frequently searched for terms for which no Wikipedia articles are returned". This would allow the community to than create redirect or new pages as appropriate and help address the "zero results rate" of about 30%.

While we are still waiting for this data I have recently come across a list of the most frequently clicked on redlinks on En WP produced by Andrew West https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of these can be reasonably addressed with a redirect as the issue is often capitals.

Do anyone know where things are at with respect to producing the list of most search for terms that return nothing?

-- James Heilman MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

Trey Jones

26 Jul 26 Jul

1:58 a.m.

I decided to look into this as my 10% project last week. It ended up being a 15% project, but I wanted to finish it up.

I carefully reviewed and categorized the top 100 "unsuccessful" (i.e., zero-results) queries from May 2016, and skimmed the top 1,000 from May, and skimmed and compared the top 100 / 1,000 for June.

The top result (with several variants in the top 100) is a porn site that has had a wiki page created and deleted several times. Various websites round out the top 10. Internet personalities and websites dominate the top 100 and several have had pages created and deleted over the years. There's strong evidence of links being used for some queries—though I didn't try to track them down. There's plenty of personally identifiable information in the top 1000 most frequent queries. More than 10% of the queries (by volume) get good results from the completion suggester or "did you mean" spelling suggestions, and more than 10% have some results approximately two months later (i.e., late last week).

Obvious refinements to the search strategy would eliminate so many high-frequency queries that any useful mining would be down to slogging through the low-impact long tail.

I don’t think there’s a lot here worth extracting, though others may disagree. The privacy concerns expressed earlier are genuine, and simple attempts to filter PII (using patterns, minimum IP counts, etc) are not guaranteed to be effective.

For lots more details (but no actual queries), see here:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Sear...

—Trey

Trey Jones Software Engineer, Discovery Wikimedia Foundation

On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones tjones@wikimedia.org wrote:

...

Finally, if this is important enough and the task gets prioritized, I'd be willing to dive back in and go through the process once and pull out the top zero-results queries, this time with basic bot exclusion and IP deduplication—which we didn't do early on because we didn't realize what a mess the data was. We could process a week or a month of data and categorize the top 100 to 500 results in terms of personal info, junk, porn, and whatever other categories we want or that bubble up from the data, and perhaps publish the non-personal-info part of the list as an example, either to persuade ourselves that this is worth pursuing, or as a clearer counter to future calls to do so. —Trey

...

...

---------- Forwarded message ----------

...
From: "James Heilman" jmh649@gmail.com Date: Jul 15, 2016 06:33 Subject: [Wikimedia-l] Improving search (sort of) To: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org Cc:

A while ago I requested a list of the "most frequently searched for terms for which no Wikipedia articles are returned". This would allow the community to than create redirect or new pages as appropriate and help address the "zero results rate" of about 30%.

While we are still waiting for this data I have recently come across a list of the most frequently clicked on redlinks on En WP produced by Andrew West https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of these can be reasonably addressed with a redirect as the issue is often capitals.

Do anyone know where things are at with respect to producing the list of most search for terms that return nothing?

-- James Heilman MD, CCFP-EM, Wikipedian

3053

Age (days ago)

3063

Last active (days ago)

discovery@lists.wikimedia.org

2 comments

2 participants

tags (0)

participants (2)

Pine W
Trey Jones