Re: [Analytics] Anonymizing and releasing 'edits per country' data for Wiki Projects

25 Aug 2014

Hey Yuvi,

this sounds like very interesting data to look at.  Here are my thoughts:

- the Anonymization scheme sounds reasonable, and I'd like to hear from
someone else @ wikimedia who has similar experience anonymizing data sets

- you were probably already thinking about it, but we need documentation
too: a wikipage with the name of the table, data dictionary, etc... and
even a blog post to announce the newly available data.

On Sun, Aug 24, 2014 at 5:21 PM, Yuvi Panda &lt;yuvipanda(a)gmail.com&gt; wrote:

...
  Hello!

 I've been working for the last few days on
 https://github.com/Ironholds/WPDMZ, which currently generates raw data
 on 'number of non-bot edits per country', and I'd like to run some
 stats / make some graphs based on it. Since I'd like al l my
 'research' to be completely repeatable, I'd love it if we can make the
 'raw data' (edits per country) publicly available on labsdb. I have
 most of the code written for it, *but* it needs anonymization.

 The biggest de-anonymization threats involve identifying which editors
 come from which countries, and can be executed in the following case:

 An editor is the only person editing from a country in a project where
 the country has low edit volume, and by a process of elimination /
 counting edits from a public source (like recentchanges), the
 individual editor can be connected to a particular country

 I propose the following Anonymization scheme:

 1. No data for projects with less than a threshold of total
 *individual editors* in the time period for which the data is
 released.
 2. For countries that have less than a threshold % of 'individual
 editors' in the time period, we just simply lump them in as 'other'.

 This removes most anonymization attacks I can think of. Thoughts? I
 can easily write up the code to generate these on a monthly basis and
 puppetize those to make the data publicly available. I think not just
 me, but lots of external researchers would benefit from such data.

 Thanks!

 --
 Yuvi Panda T
 http://yuvi.in/blog

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Anonymizing and releasing 'edits per country' data for Wiki Projects