Hi Gerard,
I chatted with Trey (who did the analysis) for his opinion on your concerns. Here is his response:
Hi Gerard,
I wasn't trying to pass judgement on notability when the search referred to
a particular person, place, or thing, but I did take it as a sign of non-notability when a page had been created and then deleted for a particular person or website. Those items could become notable in the future, and any of them might be notable enough for Wikidata—but the original discussion seemed to be mainly about queries to English Wikipedia. My conclusion, for English Wikipedia, is that there is not some gold mine of super high-frequency typos or new topics that we are missing out on. More importantly, there are real privacy concerns, and simple fixes—like requiring some number of unique IP addresses to have searched fro something—are not enough. I have looked at thousands of queries from about a dozen other language Wikipedias—some in more depth than others, and admittedly not usually sorted by frequency—but my intuition is the same as it was for English Wikipedia: not enough of value there to override privacy concerns. Automation is out for privacy reasons and manual review is not worth it, so this isn't a priority for Discovery right now.
I hope that helps to further explain what we found and why we're not acting further on this issue at this time.
Cheers,
Deb
-- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation
On Sat, Jul 30, 2016 at 1:30 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, So what do we have? It is what the most missed searches are for the English Wikipedia. Arguably the searches include content that is "iffie". But when many people seek info on a porn site, on what basis is it not notable? This is only for en.wp and the results for other languages can be quite different.The problem with dismissing the need for this data in this way is that it supports the status quo for all Wikipedias. It does not suggest what we can do with a porn site. We could for instance have a Wikidata item stating that it is a porn site and leave it at that.
When you compare Wikidata with Wikipedia, Wikidata has significantlyu more data about whatever than Wikipedia does. All subjects that are notable by Wikidata standards and many are notable by English Wikipedia standards. Knowing what subjects are missed in Wikipedia and what people are looking for is important because they are the people Wikipedia misses.
NB thanks for the data, the project. Thanks, GerardM
On 29 July 2016 at 23:48, Deborah Tankersley dtankersley@wikimedia.org wrote:
Forwarding to the Wikimedia mailing list, I'm sorry for the lateness!
-- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation
---------- Forwarded message ---------- From: Trey Jones tjones@wikimedia.org Date: Mon, Jul 25, 2016 at 11:58 AM Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of) To: A public mailing list about Wikimedia Search and Discovery projects < discovery@lists.wikimedia.org> Cc: James Heilman jmh649@gmail.com
I decided to look into this as my 10% project last week. It ended up
being
a 15% project, but I wanted to finish it up.
I carefully reviewed and categorized the top 100 "unsuccessful" (i.e., zero-results) queries from May 2016, and skimmed the top 1,000 from May, and skimmed and compared the top 100 / 1,000 for June.
The top result (with several variants in the top 100) is a porn site that has had a wiki page created and deleted several times. Various websites round out the top 10. Internet personalities and websites dominate the
top
100 and several have had pages created and deleted over the years.
There's
strong evidence of links being used for some queries—though I didn't try
to
track them down. There's plenty of personally identifiable information in the top 1000 most frequent queries. More than 10% of the queries (by volume) get good results from the completion suggester or "did you mean" spelling suggestions, and more than 10% have some results approximately
two
months later (i.e., late last week).
Obvious refinements to the search strategy would eliminate so many high-frequency queries that any useful mining would be down to slogging through the low-impact long tail.
I don’t think there’s a lot here worth extracting, though others may disagree. The privacy concerns expressed earlier are genuine, and simple attempts to filter PII (using patterns, minimum IP counts, etc) are not guaranteed to be effective.
For lots more details (but no actual queries), see here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Sear...
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones tjones@wikimedia.org
wrote:
Finally, if this is important enough and the task gets prioritized, I'd
be
willing to dive back in and go through the process once and pull out
the
top zero-results queries, this time with basic bot exclusion and IP deduplication—which we didn't do early on because we didn't realize
what
a
mess the data was. We could process a week or a month of data and categorize the top 100 to 500 results in terms of personal info, junk, porn, and whatever other categories we want or that bubble up from the data, and perhaps publish the non-personal-info part of the list as an example, either to persuade ourselves that this is worth pursuing, or
as
a
clearer counter to future calls to do so. —Trey
---------- Forwarded message ----------
From: "James Heilman" jmh649@gmail.com Date: Jul 15, 2016 06:33 Subject: [Wikimedia-l] Improving search (sort of) To: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org Cc:
A while ago I requested a list of the "most frequently searched for
terms
for which no Wikipedia articles are returned". This would allow the community to than create redirect or new pages as appropriate and help address the "zero results rate" of about 30%.
While we are still waiting for this data I have recently come across a list of the most frequently clicked on redlinks on En WP produced by Andrew West https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks
Many
of
these can be reasonably addressed with a redirect as the issue is
often
capitals.
Do anyone know where things are at with respect to producing the list
of
most search for terms that return nothing?
-- James Heilman MD, CCFP-EM, Wikipedian
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe