Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal - Wiki-research-l

List overview All Threads
Download

newer

Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

older

How many links did TWL account...

Open Positions

Dario Taraborelli

13 Jan 2015 13 Jan '15

8:26 a.m.

I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1] Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps. Feedback on the proposal is welcome on the lists or the project talk page on Meta [3] Dario [1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev… [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

Show replies by thread

Andrew Gray

13 Jan 13 Jan

9:03 p.m.

New subject: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

Hi Dario, Reid, This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks... A couple of thoughts: * I understand the reasoning for not using do-not-track headers (#4); however, it feels a bit odd to say "they probably don't mean us" and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users? * Option 3 does releases one further piece of information over and above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware. Andrew. On 13 January 2015 at 07:26, Dario Taraborelli <dtaraborelli(a)wikimedia.org> wrote:

...

-- - Andrew Gray andrew.gray(a)dunelm.org.uk

Aaron Halfaker

9:24 p.m.

New subject: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

Andrew, I think it is reasonable to assume that the "Do not track" header isn't referring to this.

...

From http://donottrack.us/ with emphasis added.

...

Do Not Track is a technology and policy proposal that enables users to opt out of *tracking by websites they do not visit*, [...]

Do not track is explicitly for third party tracking. We are merely proposing to count those people who do access our sites. Note that, in this case, we are not interested in obtaining identifiers at all, so the word "track" seems to not apply. It seems like we're looking for something like a "Do Not Log Anything At All" header. I don't believe that such a thing exists -- but if it did I think it would be good if we supported it. -Aaron On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray <andrew.gray(a)dunelm.org.uk> wrote:

...

I’m sharing a proposal that Reid Priedhorsky and his collaborators at

Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]

Reid and his team spearheaded the use of the public Wikipedia pageview

dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.

Feedback on the proposal is welcome on the lists or the project talk

page on Meta [3]

Dario [1]

https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev…

[2] http://dx.doi.org/10.1371/journal.pcbi.1003892 [3]

https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- - Andrew Gray andrew.gray(a)dunelm.org.uk _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Andrew Gray

11:22 p.m.

New subject: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here. (It might be worth clarifying this in the proposal, in case a future ethics-committee reviewer gets the same misapprehension?) Andrew. On 13 January 2015 at 20:24, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:

...

Andrew, I think it is reasonable to assume that the "Do not track" header isn't referring to this. From http://donottrack.us/ with emphasis added.

Do Not Track is a technology and policy proposal that enables users to opt out of tracking by websites they do not visit, [...]

_______________________________________________ Analytics mailing list Analytics(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- - Andrew Gray andrew.gray(a)dunelm.org.uk

John Mark Vandenberg

14 Jan 14 Jan

2:46 a.m.

New subject: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

On Wed, Jan 14, 2015 at 9:22 AM, Andrew Gray <andrew.gray(a)dunelm.org.uk> wrote:

...

Oliver Keyes

4:25 a.m.

New subject: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

I'm confused; john, could you point to the element of the collected data that isn't collected already by default in any Nginx or Apache setup? I agree that there might be a lack of user expectation, but 'silently capturing behavioral data' seems somewhat hyperbolic to describe what's actually going on. On Tuesday, 13 January 2015, John Mark Vandenberg <jayvdb(a)gmail.com> wrote:

...

On Wed, Jan 14, 2015 at 9:22 AM, Andrew Gray <andrew.gray(a)dunelm.org.uk <javascript:;>> wrote:

I think you're right to be concerned about this. It is about expectations; people do not expect a NGO providing an encyclopedia to be silently capturing reading behaviour data. If the data is provided to other entities, even for noble research objectives, people expect "Do Not Track" to cover this. https://cyberlaw.stanford.edu/node/6573 -- John Vandenberg _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org <javascript:;> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Sent from my mobile computing device of Lovecraftian complexity and horror.

John Mark Vandenberg

9:39 a.m.

New subject: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes <ironholds(a)gmail.com> wrote:

...

The proposed element to be added is geolocation below country level. Default Nginx and Apache log formats do not include geolocation. Which is why this research proposal exists and is being discussed, and rightly so. fwiw, the Nginx geoip module is not even included, by default, when compiling the source code. As the paper explicitly describes, and is a common theme in research proposals, Wikimedia access log information is user reading behaviour being captured. The old privacy and data retention policies gave users the expectation that access log data was destroyed after a set period, assumed to be only three months as that was the limit of Checkuser visibility. The current policies are more like "yes we collect a lot of data about users, using tracking technology, and please trust us." And "sorry we dont honour 'Dont track us', as we presumed that you trust us and the researchers that we allow to access our analytics." We should be planning for what will be the effect when the WMF servers are hacked and _all_ of the analytics data is now in the hands of a repressive government or similar. Or, imagine the WMF sends the analytics data across an insecure link which is tapped and the data reconstructed, either due to not using secure links at all, or an accidental routing problem. https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html If/When that day comes, hopefully they don't have much data to make inferences from, and what data they obtain can be well justified. Having a quick peak, I thought it was odd that browser Wikimedia sites now causes impressions to be sent back to the WMF servers with the country of the user included. "This is a workaround to simplify analytics." https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee87… The more you collect, especially using multiple systems to collect similar data, the more likely that if subpoenaed, WMF's various datasets could be used to infer a pretty reliable answer to "which days in 2013 was John Vandenberg in Indonesia?", or "when did John Vandenberg first read the Wikipedia article about <bomb making ingredient>?" The more you publish, even aggregated, the more likely these types of questions can be inferred without a subpoena, at least for users with large enough lists of public contributions, by scientists like yourself with lots of computation power and plenty of time on their hands rifling through the data to *infer* the identify of editors, and if it is a government body they also have lots of other datasets which can be used to assist in the task. Adding fine-grained geolocation information to published page views is an example of the latter and the paper wisely suggests not including logged in users as a possible solution to some of the privacy issues. There is also the problem that many IPs can be easily inferred to be a single cohort of people in some situations. e.g. in regions where the only large collection of computers is an single facility, e.g. a school. In a repressive regime especially, that could lead to official questions being asked like: why were so many students at this school reading about <blah> on <date>. And teachers being identified as responsible, etc. The paper considers IP users vs logged in users to be a binary set. However there are tools built which exploit the fact that logged in users make a logged out edit which identifies their IP. Add geolocation of pageviews and we can infer the probability that other IPs in their smallest geolocation block are also likely to be edits by the same person, as the algorithm in the paper leaks 'number of active editors in each region each day'. The purpose of this proposed change in analytics is summarised in the paper: 'In short, the current global aggregation of Wikipedia page view is unsuitable for an operational disease monitoring system. There will be no “Wikipedia Flu Trends” unless page view data are aggregated at a finer geographic scale.' If "Wikipedia Flu Trends" is the justification, we had better be certain that detecting Flu Trends using Wikipedia is going to be the most effective method, and isn't just an academically interesting exercise. A limited trial to determine utility would be helpful to establish if "Wikipedia Flu Trends" is a viable world health solution worthy of justifying additional data retention and publishing of aggregates. Is there a minimum threshold at which views of a page mean it becomes 'interesting' to analyse using finer grained geographic data. I suspect that pages with only hundreds of page views per day are not particularly useful for "Wikipedia Flu Trends". Also does "Wikipedia Flu Trends" need to have access to geographically tagged page view data of, say, me reading http://en.wiktionary.org/wiki/bota today? Is there a way to restrict which types of pages are tracked at finer geographic granularity without adversely affecting the "Wikipedia Flu Trends" graph. -- John Vandenberg

Oliver Keyes

4:54 p.m.

New subject: [Wiki-research-l] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

On Wed, Jan 14, 2015 at 3:39 AM, John Mark Vandenberg <jayvdb(a)gmail.com> wrote:

...

On Wed, Jan 14, 2015 at 2:25 PM, Oliver Keyes <ironholds(a)gmail.com> wrote:

Gotcha: I thought you were referring to the information we already have.

...

fwiw, the Nginx geoip module is not even included, by default, when compiling the source code. As the paper explicitly describes, and is a common theme in research proposals, Wikimedia access log information is user reading behaviour being captured. The old privacy and data retention policies gave users the expectation that access log data was destroyed after a set period, assumed to be only three months as that was the limit of Checkuser visibility. The current policies are more like "yes we collect a lot of data about users, using tracking technology, and please trust us." And "sorry we dont honour 'Dont track us', as we presumed that you trust us and the researchers that we allow to access our analytics." We should be planning for what will be the effect when the WMF servers are hacked and _all_ of the analytics data is now in the hands of a repressive government or similar. Or, imagine the WMF sends the analytics data across an insecure link which is tapped and the data reconstructed, either due to not using secure links at all, or an accidental routing problem. https://lists.wikimedia.org/pipermail/wikimedia-l/2013-December/129357.html

The geolocation proposal is to perform it over IP addresses...which are already stored. So, the only major difference between "hacking" now and "hacking" later is that doing it later means you don't have to spend 99 bucks on a geolocation hashtable.

...

If/When that day comes, hopefully they don't have much data to make inferences from, and what data they obtain can be well justified. Having a quick peak, I thought it was odd that browser Wikimedia sites now causes impressions to be sent back to the WMF servers with the country of the user included. "This is a workaround to simplify analytics." https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/8ee87…

CentralNotice and the fundraising banners have done this for absolutely years, yes; that's the code you're looking at.

...

The more you collect, especially using multiple systems to collect similar data, the more likely that if subpoenaed, WMF's various datasets could be used to infer a pretty reliable answer to "which days in 2013 was John Vandenberg in Indonesia?", or "when did John Vandenberg first read the Wikipedia article about <bomb making ingredient>?" The more you publish, even aggregated, the more likely these types of questions can be inferred without a subpoena, at least for users with large enough lists of public contributions, by scientists like yourself with lots of computation power and plenty of time on their hands rifling through the data to *infer* the identify of editors, and if it is a government body they also have lots of other datasets which can be used to assist in the task.

Yep, and that's why we're discussing this.

...

Adding fine-grained geolocation information to published page views is an example of the latter and the paper wisely suggests not including logged in users as a possible solution to some of the privacy issues. There is also the problem that many IPs can be easily inferred to be a single cohort of people in some situations. e.g. in regions where the only large collection of computers is an single facility, e.g. a school. In a repressive regime especially, that could lead to official questions being asked like: why were so many students at this school reading about <blah> on <date>. And teachers being identified as responsible, etc. The paper considers IP users vs logged in users to be a binary set. However there are tools built which exploit the fact that logged in users make a logged out edit which identifies their IP. Add geolocation of pageviews and we can infer the probability that other IPs in their smallest geolocation block are also likely to be edits by the same person, as the algorithm in the paper leaks 'number of active editors in each region each day'.

No, it doesn't: the proposal is to aggregate. Where there are few observations (or little variation in observations) within a geographic region, the data will be moved up one level and aggregated, and so on until a sufficient degree of fuzziness is reached. This is the very basis of k- and i-anonymity.

...

The purpose of this proposed change in analytics is summarised in the paper: 'In short, the current global aggregation of Wikipedia page view is unsuitable for an operational disease monitoring system. There will be no “Wikipedia Flu Trends” unless page view data are aggregated at a finer geographic scale.' If "Wikipedia Flu Trends" is the justification, we had better be certain that detecting Flu Trends using Wikipedia is going to be the most effective method, and isn't just an academically interesting exercise. A limited trial to determine utility would be helpful to establish if "Wikipedia Flu Trends" is a viable world health solution worthy of justifying additional data retention and publishing of aggregates. Is there a minimum threshold at which views of a page mean it becomes 'interesting' to analyse using finer grained geographic data. I suspect that pages with only hundreds of page views per day are not particularly useful for "Wikipedia Flu Trends". Also does "Wikipedia Flu Trends" need to have access to geographically tagged page view data of, say, me reading http://en.wiktionary.org/wiki/bota today? Is there a way to restrict which types of pages are tracked at finer geographic granularity without adversely affecting the "Wikipedia Flu Trends" graph. -- John Vandenberg _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

3384

days inactive

3385

days old

wiki-research-l@lists.wikimedia.org

Manage subscription

7 comments

5 participants

tags (0)

participants (5)

Aaron Halfaker
Andrew Gray
Dario Taraborelli
John Mark Vandenberg
Oliver Keyes