New subject: [discovery] Fwd: Improving search (sort of)

15 Jul 2016


      Hey Trey
Thanks for the in depth discussion. So if the terms people are using that
result in "zero search results" are typically gibberish why do we care if
30% of our searches result in "zero search results"? A big deal was made
about this a while ago.
If one was just to look at those search terms that more than 100 IPs
searched for would that not remove the concerns about anonymity? One could
also limit the length of the searches displaced to 50 characters. And just
provide the first 100 with an initial human review to make sure we are not
miss anything.
James
On Fri, Jul 15, 2016 at 9:31 AM, Trey Jones tjones@wikimedia.org wrote:
...
Pine, thanks for the forward. Regulars on the Discovery list may know me,
but James probably does not. I've manually reviewed tens of thousands of
generally poorly performing queries (fewer than 3 results) and skimmed
hundreds of thousands more from many of the top 20 Wikipedias—and to a
lesser extent other projects—over the year I've been at the WMF and in
Discovery. You can see my list of write ups here
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes.
So I want to say that this is an awesome idea—which is why many people
have thought of it. It was apparently one of the first ideas the Discovery
department had when they formed (see Dan's notes linked below). It was also
one of the first ideas I had when I joined Discovery a few months later.
Dan Garry's notes on T8373
https://phabricator.wikimedia.org/T8373#1856036 and the following
discussion pretty much quash the idea of automated extraction and
publication from a privacy perspective. People not only divulge their own
personal information, they also divulge other people's personal
information. One example: some guy outside the U.S. was methodically
searching long lists of real addresses in Las Vegas. I will second Dan's
comments in the T8373 discussion; all kinds of personal data end up in
search queries. A dump of search queries *was* provided in September 2012
https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/,
but had to be withdrawn over privacy concerns.
Another concern for auto-published data: never underestimate the power of
random groups of bored people on the internet. 4chan decided to arrange
Time Magazine poll results
https://techcrunch.com/2009/04/27/time-magazine-throws-up-its-hands-as-it-gets-pwned-by-4chan/ so
the first letter spelled out a weird message. It would be easy for 4chan,
Reddit, and other communities to get any message they want on that list if
they happened to notice that it existed. See also Boaty McBoatface
https://en.wikipedia.org/wiki/RRS_Sir_David_Attenborough#Name and Mountain
Dew "Diabeetus"
https://storify.com/cbccommunity/hitler-did-nothing-wrong-wins-crowdsourced-mounta
(which is not at all the worst thing on *that* list). We don't want to
have to try to defend against that.
In my experience, the quality of what's actually there isn't that great.
One of my first tasks when I joined Discovery was to look at daily lists of
top 100 zero-results queries that had been gathered automatically. I was
excited by this same idea. The top 100 zero-results query list was a
wasteland. (Minimal notes on some of what I found are here
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Highly_repeated_searches.)
We could make it better by focusing on human-ish searchers, using basic
bot-exclusion techniques
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki#Random_sampling,
ignoring duplicates from the same IP, and such, but I don't think it would
help. And while Wikipedia is not for children, there could be an annoying
amount of explicit adult material on the list, too. We would probably find
some interesting spellings of Facebook and WhatsApp, though.
If we're really excited about this, I could imagine using better
techniques to pull zero-results queries and see if anything good is in
there, but we'd have to commit to some sort of review before we publish it.
For example, Discernatron https://discernatron.wmflabs.org/ data, after
consulting with legal, is reviewed independently by two people, who then
have to reconcile any discrepancies, before being made public. So I think
we'd need an ongoing commitment to have at least two people under NDA who
would review any list before publication. 500-600 queries takes a couple
hours per person (we’ve done that for the Discernatron), so the top 100
would probably be less than an hour. I'd even be willing to help with the
review (as I am for Discernatron) if we found there was something useful in
there—but I'm not terribly hopeful. We'd also need more people to
efficiently and effectively review queries for other languages if we wanted
to extend this beyond English Wikipedia.
Finally, if this is important enough and the task gets prioritized, I'd be
willing to dive back in and go through the process once and pull out the
top zero-results queries, this time with basic bot exclusion and IP
deduplication—which we didn't do early on because we didn't realize what a
mess the data was. We could process a week or a month of data and
categorize the top 100 to 500 results in terms of personal info, junk,
porn, and whatever other categories we want or that bubble up from the
data, and perhaps publish the non-personal-info part of the list as an
example, either to persuade ourselves that this is worth pursuing, or as a
clearer counter to future calls to do so.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Fri, Jul 15, 2016 at 10:09 AM, Pine W wiki.pine@gmail.com wrote:
...
Forwarding
Pine
---------- Forwarded message ----------
From: "James Heilman" jmh649@gmail.com
Date: Jul 15, 2016 06:33
Subject: [Wikimedia-l] Improving search (sort of)
To: "Wikimedia Mailing List" wikimedia-l@lists.wikimedia.org
Cc:
A while ago I requested a list of the "most frequently searched for terms
for which no Wikipedia articles are returned". This would allow the
community to than create redirect or new pages as appropriate and help
address the "zero results rate" of about 30%.
While we are still waiting for this data I have recently come across a
list
of the most frequently clicked on redlinks on En WP produced by Andrew
West
https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of
these can be reasonably addressed with a redirect as the issue is often
capitals.
Do anyone know where things are at with respect to producing the list of
most search for terms that return nothing?
--
James Heilman
MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery
-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com

Re: [Wikimedia-l] [discovery] Fwd: Improving search (sort of)