Re: [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

13 May 2015

Hi folks,

Reviving an old thread (my apologies for the delay). I’ve looked over this thread, the
talk page linked below, and a few other places that seemed like they might have feedback
for us.

It seemed to me that key feedback, in addition to some technical suggestions, was:

  *   Ratio of logged in to logged out readers can be inferred.
  *   Think more carefully about whether reading patterns can be inferred for anonymous
editors.
  *   How to interpret the Do-Not-Track header is controversial.

As for DNT, my main concern from the research perspective is, would interpreting DNT as
exclusion from geo-aggregation reduce the sample size excessively. Luis Villa’s link for
Firefox numbers shows a peak of 11% in March 2013, declining to 8% at the end of the data
in September 2014, for desktop version, with a 17% peak in July 2012 and a similar decline
to 5% in September 2014 for mobile users. With these types of numbers, I believe the
larger sample (i.e., DNT hits included in geo-aggregation) will indeed support somewhat
more robust results, but the smaller sample (exclude DNT) is fine. I worry some about
growth, but as long as it’s not the default, that’s probably not a major concern.

One thing that I would really like feedback on is: what is an acceptable k — i.e., how
large is the set of users from whom a specific user is indistinguishable? I believe this
will have a significantly greater impact on the quality of our results than DNT.

Please let me know if I’ve missed anything. I’d like to rev the proposal soon, and I’d
like to make it responsive to what the community thinks.

Thanks,
Reid

[Just to be absolutely clear, I’m speaking for myself, not my employer.]

On 13 January 2015 at 07:26, Dario Taraborelli
<dtaraborelli@wikimedia.org<mailto:dtaraborelli@wikimedia.org>> wrote:
...
 >
 I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos
National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing
privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them
available to the public and the research community. [1]

Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor
and forecast the spread of influenza and other diseases, using language as a proxy for
location. This proposal describes an aggregation strategy adding a geographical dimension
to the existing dumps.

Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]

Dario

[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagev…
[2] http://dx.doi.org/10.1371/journal.pcbi.1003892
[3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_…

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal