Hi folks,

Reviving an old thread (my apologies for the delay). I’ve looked over this thread, the talk page linked below, and a few other places that seemed like they might have feedback for us.

It seemed to me that key feedback, in addition to some technical suggestions, was:

As for DNT, my main concern from the research perspective is, would interpreting DNT as exclusion from geo-aggregation reduce the sample size excessively. Luis Villa’s link for Firefox numbers shows a peak of 11% in March 2013, declining to 8% at the end of the data in September 2014, for desktop version, with a 17% peak in July 2012 and a similar decline to 5% in September 2014 for mobile users. With these types of numbers, I believe the larger sample (i.e., DNT hits included in geo-aggregation) will indeed support somewhat more robust results, but the smaller sample (exclude DNT) is fine. I worry some about growth, but as long as it’s not the default, that’s probably not a major concern.

One thing that I would really like feedback on is: what is an acceptable k — i.e., how large is the set of users from whom a specific user is indistinguishable? I believe this will have a significantly greater impact on the quality of our results than DNT.

Please let me know if I’ve missed anything. I’d like to rev the proposal soon, and I’d like to make it responsive to what the community thinks.


[Just to be absolutely clear, I’m speaking for myself, not my employer.]

On 13 January 2015 at 07:26, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los Alamos National Laboratory recently submitted to the Wikimedia Analytics Team aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview data dumps and making them available to the public and the research community. [1]

Reid and his team spearheaded the use of the public Wikipedia pageview dumps to monitor and forecast the spread of influenza and other diseases, using language as a proxy for location. This proposal describes an aggregation strategy adding a geographical dimension to the existing dumps.

Feedback on the proposal is welcome on the lists or the project talk page on Meta [3]


[1] https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
[2] http://dx.doi.org/10.1371/journal.pcbi.1003892
[3] https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews