From: analytics-bounces(a)lists.wikimedia.org [mailto:analytics-
bounces(a)lists.wikimedia.org] On Behalf Of Yuvi Panda
Sent: Monday, August 25, 2014 2:22
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: [Analytics] Anonymizing and releasing 'edits per country' data for
I've been working for the last few days on
, which currently generates raw data
on 'number of non-bot edits per country', and I'd like to run some stats /
make some graphs based on it. Since I'd like al l my 'research' to be
completely repeatable, I'd love it if we can make the 'raw data' (edits per
country) publicly available on labsdb. I have most of the code written for it,
*but* it needs anonymization.
The biggest de-anonymization threats involve identifying which editors come
from which countries, and can be executed in the following case:
An editor is the only person editing from a country in a project where the
country has low edit volume, and by a process of elimination / counting edits
from a public source (like recentchanges), the individual editor can be
connected to a particular country
I propose the following Anonymization scheme:
1. No data for projects with less than a threshold of total *individual editors*
in the time period for which the data is released.
2. For countries that have less than a threshold % of 'individual editors' in
time period, we just simply lump them in as 'other'.
This removes most anonymization attacks I can think of. Thoughts? I can
easily write up the code to generate these on a monthly basis and puppetize
those to make the data publicly available. I think not just me, but lots of
external researchers would benefit from such data.
Yuvi Panda T
Analytics mailing list