Erik and Dario,
Thanks for your response. You're right - there are privacy challenges, and
AOL's debacle is a case study in what one should NOT do, but that doesn't
mean it isn't possible. In an effort to move this conversation forward
concretely, I'll make a proposal:
For each article, print the k most frequent third-party search
queries (Google / Bing / etc). Do not include queries used fewer than t
times.
This seems benign to me for sufficiently small k and large t, but I'm
interested in hearing what others think. Perhaps there are some other added
constraints that make the dataset seem safer? For example, you could
require t unique ip addresses.
I understand that there are engineering challenges associated with this
dataset, but I think it's useful to figure out if it's even permissible
first.
-Shilad
On Mon, Dec 16, 2013 at 10:03 AM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Hi,
Christian is working on page view definitions mostly.
So I gladly defer to him for further comments.
One comment on anonymized search stats:
This has been tried before, also at WMF, and was not a very good idea.
Even with extensive filtering some very privacy sensitive data were still
exposed (think passwords).
Those password data shouldn’t have landed in the search box, but anything
that can be input wrongly, will be.
As for de-anonimization, and how disastrous this can be:
https://en.wikipedia.org/wiki/AOL_search_data_leak
*101 Dumbest Moments in Business
<http://money.cnn.com/galleries/2007/biz2/0701/gallery.101dumbest_2007/index.html>*
*57. AOL, Part 2*
In an "attempt to reach out to the academic community with new research
tools," AOL releases the search queries of 657,000 users.
Though AOL insists that the data contains no personally identifiable
information, the New York Times and other news outlets promptly identify a
number of specific users, including searcher No. 4417749,
soon-to-be-ex-AOL-subscriber Thelma Arnold of Lilburn, Ga., whose queries
include "womens underwear" and "dog that urinates on everything."
The gaffe leads to the resignation of AOL's chief technology officer and a
half-billion-dollar class-action lawsuit.
Erik Zachte
*From:* analytics-bounces(a)lists.wikimedia.org [mailto:
analytics-bounces(a)lists.wikimedia.org] *On Behalf Of *Dario Taraborelli
*Sent:* Saturday, December 14, 2013 1:04
*To:* A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
*Cc:* Brent Hecht
*Subject:* Re: [Analytics] Wikipedia dataset to support NLP disambiguation
Shilad, Edgard,
replying here to this and related threads.
1) Analytics Engineering (primarily Christian Aistleitner and Erik Zachte)
is currently working on defining how pageviews are counted and extracted
from the raw request logs. This work is part of a larger effort to replace
the legacy webstatscollector [1], which will produce more accurate data as
well as the ability to parse requests in a more flexible way. Christian and
Erik should be able to comment on whether referral and search query string
extraction for inbound traffic are use cases falling within the initial
scope of this project. The publication of this data is a different matter
(see below).
2) Releasing anonymized internal search data is not AFAIK one of the
priorities the team is currently working on. As Andrew noted in a previous
thread, engineering effort aside, further releases of private data will be
subject to the new privacy policy which is currently being discussed on
Meta [2]. I don’t expect we’ll invest any effort into anonymizing or
aggregating data for the purpose of publication until the privacy policy
consultation is settled. Search is also undergoing a major overhaul [3]
3) As per Federico, a short description of all logs generated at Wikimedia
(including MediaWiki logs, page request logs, search logs and EventLogging
data) can be found at on Wikitech [4]
Dario
[1]
https://wikitech.wikimedia.org/wiki/Analytics/Webstatscollector
[2]
https://meta.wikimedia.org/wiki/Privacy_policy
[3]
https://www.mediawiki.org/wiki/Search
[4]
https://wikitech.wikimedia.org/wiki/Logs
On Dec 10, 2013, at 10:09 PM, Shilad Sen <ssen(a)macalester.edu> wrote:
Greetings!
I'm a Professor at Macalester College in Minnesota, and I have been
collaborating with Brent Hecht and many students to develop a Java
framework for extracting multilingual knowledge from Wikipedia [1]. The
framework is pre-alpha now, but we hope to offer a stable release in the
next month.
Given a phrase (e.g. "apple"), our library must identifying articles
associated with a phrase. This is a probabilistic question. How likely is
the phrase "apple" to refer to the article about the fruit vs the company?
This simple task (often called Wikification or disambiguation) forms the
basis of many NLP algorithms.
Google and Stanford have released an awesome dataset to support this task
[2]. It contains the *text* of all internet hyperlinks to Wikipedia
articles. This dataset makes the problem much easier, but it has two
serious deficiencies. First, it only contains links to articles in English
Wikipedia. Second, it was generated once by Google, and it is unlikely
Google will update it.
The WMF could create a similar dataset by publishing the most common
inbound search queries for all WP pages across all language editions. This
dataset would enable individuals, researchers and small companies (not just
Google and Microsoft) to harness Wikipedia data for their applications.
Does this seem remotely possible? I've thought a little about engineering
and privacy issues related to the dataset. Neither are trivial, but I think
they are feasible, and I'd be happy to volunteer my engineering effort.
If you think the idea has legs, how do we develop a more formal proposal
about the dataset?
Thanks for your feedback!
-Shilad
[1]
https://github.com/shilad/wikAPIdia
[2]
http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.…
--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen(a)macalester.edu
651-696-6273
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen(a)macalester.edu
651-696-6273