Hello!
I've been working for the last few days on https://github.com/Ironholds/WPDMZ, which currently generates raw data on 'number of non-bot edits per country', and I'd like to run some stats / make some graphs based on it. Since I'd like al l my 'research' to be completely repeatable, I'd love it if we can make the 'raw data' (edits per country) publicly available on labsdb. I have most of the code written for it, *but* it needs anonymization.
The biggest de-anonymization threats involve identifying which editors come from which countries, and can be executed in the following case:
An editor is the only person editing from a country in a project where the country has low edit volume, and by a process of elimination / counting edits from a public source (like recentchanges), the individual editor can be connected to a particular country
I propose the following Anonymization scheme:
1. No data for projects with less than a threshold of total *individual editors* in the time period for which the data is released. 2. For countries that have less than a threshold % of 'individual editors' in the time period, we just simply lump them in as 'other'.
This removes most anonymization attacks I can think of. Thoughts? I can easily write up the code to generate these on a monthly basis and puppetize those to make the data publicly available. I think not just me, but lots of external researchers would benefit from such data.
Thanks!
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
- the Anonymization scheme sounds reasonable, and I'd like to hear from someone else @ wikimedia who has similar experience anonymizing data sets
- you were probably already thinking about it, but we need documentation too: a wikipage with the name of the table, data dictionary, etc... and even a blog post to announce the newly available data.
On Sun, Aug 24, 2014 at 5:21 PM, Yuvi Panda yuvipanda@gmail.com wrote:
Hello!
I've been working for the last few days on https://github.com/Ironholds/WPDMZ, which currently generates raw data on 'number of non-bot edits per country', and I'd like to run some stats / make some graphs based on it. Since I'd like al l my 'research' to be completely repeatable, I'd love it if we can make the 'raw data' (edits per country) publicly available on labsdb. I have most of the code written for it, *but* it needs anonymization.
The biggest de-anonymization threats involve identifying which editors come from which countries, and can be executed in the following case:
An editor is the only person editing from a country in a project where the country has low edit volume, and by a process of elimination / counting edits from a public source (like recentchanges), the individual editor can be connected to a particular country
I propose the following Anonymization scheme:
- No data for projects with less than a threshold of total
*individual editors* in the time period for which the data is released. 2. For countries that have less than a threshold % of 'individual editors' in the time period, we just simply lump them in as 'other'.
This removes most anonymization attacks I can think of. Thoughts? I can easily write up the code to generate these on a monthly basis and puppetize those to make the data publicly available. I think not just me, but lots of external researchers would benefit from such data.
Thanks!
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear from
someone else @ wikimedia who has similar experience anonymizing data sets
Glad to hear that!
- you were probably already thinking about it, but we need documentation
too: a wikipage with the name of the table, data dictionary, etc... and even a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have publicly available. A lot of the initiatives we fund are at a country-based level, and our partners have a really hard time understanding the effects of the work they are doing on the aggregate language-wiki level. In addition to this edits per country, it would be even more important for us to get the total number of editors / active editors by country as well. Kevin - it would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything about it yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda yuvipanda@gmail.com wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear from
someone else @ wikimedia who has similar experience anonymizing data sets
Glad to hear that!
- you were probably already thinking about it, but we need documentation
too: a wikipage with the name of the table, data dictionary, etc... and
even
a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yay to more people finding it useful :)
Editors / Active editors isn't too hard to had programatically. The bigger problem is how to define 'editor from country' - one edit from that country? Does that mean that one editor can be considered to be from multiple countries? Do we double count mobile and desktop as separate?
An easy way to do this would be: 1. An 'editor from a country' is someone who has made at least one edit from that country 2. A 'desktop editor from a country' is someone who has made at least one edit from that country on desktop 3. A 'mobile editor from a country' is someone who has made at least one edit from that country on mobile
This muddles the data some what, since sum(editors_from_all_countries_for_a_project) != total_editors_for_project, and also sum(mobile_editors, desktop_editors) per country != total_editors per country. However, this is super simple to implement and also still useful, so I might end up doing that.
Of course, assuming this entire thing gets OK'd fully by analytics :)
On Mon, Aug 25, 2014 at 6:14 PM, Jessie Wild jwild@wikimedia.org wrote:
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have publicly available. A lot of the initiatives we fund are at a country-based level, and our partners have a really hard time understanding the effects of the work they are doing on the aggregate language-wiki level. In addition to this edits per country, it would be even more important for us to get the total number of editors / active editors by country as well. Kevin - it would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything about it yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda yuvipanda@gmail.com wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear from
someone else @ wikimedia who has similar experience anonymizing data sets
Glad to hear that!
- you were probably already thinking about it, but we need documentation
too: a wikipage with the name of the table, data dictionary, etc... and even a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jessie Wild Sneller Grantmaking Learning & Evaluation Wikimedia Foundation
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia
I know the researchers have already put some thought around how to break down editors by country. I think this falls under metric standardization and we want to have a consistent way of counting editors who edit across several projects. There is also an expectation that as you aggregate numbers, editors are deduplicated (not counted twice).
I don't know if we are totally prepared to have this conversation yet. My priority list has discussions on how to break down editors by target site (desktop, mobile, API) first, then how to aggregate editors across projects.
On Mon, Aug 25, 2014 at 10:55 AM, Yuvi Panda yuvipanda@gmail.com wrote:
Yay to more people finding it useful :)
Editors / Active editors isn't too hard to had programatically. The bigger problem is how to define 'editor from country' - one edit from that country? Does that mean that one editor can be considered to be from multiple countries? Do we double count mobile and desktop as separate?
An easy way to do this would be:
- An 'editor from a country' is someone who has made at least one
edit from that country 2. A 'desktop editor from a country' is someone who has made at least one edit from that country on desktop 3. A 'mobile editor from a country' is someone who has made at least one edit from that country on mobile
This muddles the data some what, since sum(editors_from_all_countries_for_a_project) != total_editors_for_project, and also sum(mobile_editors, desktop_editors) per country != total_editors per country. However, this is super simple to implement and also still useful, so I might end up doing that.
Of course, assuming this entire thing gets OK'd fully by analytics :)
On Mon, Aug 25, 2014 at 6:14 PM, Jessie Wild jwild@wikimedia.org wrote:
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have
publicly
available. A lot of the initiatives we fund are at a country-based level, and our partners have a really hard time understanding the effects of the work they are doing on the aggregate language-wiki level. In addition to this edits per country, it would be even more important for us to get the total number of editors / active editors by country as well. Kevin - it would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything about
it
yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda yuvipanda@gmail.com wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org
wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear
from
someone else @ wikimedia who has similar experience anonymizing data sets
Glad to hear that!
- you were probably already thinking about it, but we need
documentation
too: a wikipage with the name of the table, data dictionary, etc...
and
even a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jessie Wild Sneller Grantmaking Learning & Evaluation Wikimedia Foundation
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia
-- Yuvi Panda T http://yuvi.in/blog
I would just like to second Jessie's enthusiasm: this would be helpful to have available for certain media requests, particularly for journalists from smaller countries that want to write profiles about the degree of engagement their fellow citizens have with Wikipedia.
Of course, we may find some disappointing stories in there as well, particularly underscoring the lack of editors from global south countries, so it will be important to deliver a good explanation for why anonymization means we have no useable data on certain nations. In any case, the results will be interesting.
On Mon, Aug 25, 2014 at 1:26 PM, Kevin Leduc kevin@wikimedia.org wrote:
I know the researchers have already put some thought around how to break down editors by country. I think this falls under metric standardization and we want to have a consistent way of counting editors who edit across several projects. There is also an expectation that as you aggregate numbers, editors are deduplicated (not counted twice).
I don't know if we are totally prepared to have this conversation yet. My priority list has discussions on how to break down editors by target site (desktop, mobile, API) first, then how to aggregate editors across projects.
On Mon, Aug 25, 2014 at 10:55 AM, Yuvi Panda yuvipanda@gmail.com wrote:
Yay to more people finding it useful :)
Editors / Active editors isn't too hard to had programatically. The bigger problem is how to define 'editor from country' - one edit from that country? Does that mean that one editor can be considered to be from multiple countries? Do we double count mobile and desktop as separate?
An easy way to do this would be:
- An 'editor from a country' is someone who has made at least one
edit from that country 2. A 'desktop editor from a country' is someone who has made at least one edit from that country on desktop 3. A 'mobile editor from a country' is someone who has made at least one edit from that country on mobile
This muddles the data some what, since sum(editors_from_all_countries_for_a_project) != total_editors_for_project, and also sum(mobile_editors, desktop_editors) per country != total_editors per country. However, this is super simple to implement and also still useful, so I might end up doing that.
Of course, assuming this entire thing gets OK'd fully by analytics :)
On Mon, Aug 25, 2014 at 6:14 PM, Jessie Wild jwild@wikimedia.org wrote:
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have
publicly
available. A lot of the initiatives we fund are at a country-based
level,
and our partners have a really hard time understanding the effects of
the
work they are doing on the aggregate language-wiki level. In addition to this edits per country, it would be even more important for us to get
the
total number of editors / active editors by country as well. Kevin - it would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything
about it
yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda yuvipanda@gmail.com
wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org
wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear
from
someone else @ wikimedia who has similar experience anonymizing data sets
Glad to hear that!
- you were probably already thinking about it, but we need
documentation
too: a wikipage with the name of the table, data dictionary, etc...
and
even a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jessie Wild Sneller Grantmaking Learning & Evaluation Wikimedia Foundation
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Folks -- sorry for the delay in responding.
While this data is awesome, we need to review the anonymization carefully. We once shared this data in dashboards and found some privacy issues so we needed to take it down.
We have an action item to review this issue with legal. I will follow up next week.
-Toby
On Mon, Aug 25, 2014 at 4:23 PM, Katherine Maher kmaher@wikimedia.org wrote:
I would just like to second Jessie's enthusiasm: this would be helpful to have available for certain media requests, particularly for journalists from smaller countries that want to write profiles about the degree of engagement their fellow citizens have with Wikipedia.
Of course, we may find some disappointing stories in there as well, particularly underscoring the lack of editors from global south countries, so it will be important to deliver a good explanation for why anonymization means we have no useable data on certain nations. In any case, the results will be interesting.
On Mon, Aug 25, 2014 at 1:26 PM, Kevin Leduc kevin@wikimedia.org wrote:
I know the researchers have already put some thought around how to break down editors by country. I think this falls under metric standardization and we want to have a consistent way of counting editors who edit across several projects. There is also an expectation that as you aggregate numbers, editors are deduplicated (not counted twice).
I don't know if we are totally prepared to have this conversation yet. My priority list has discussions on how to break down editors by target site (desktop, mobile, API) first, then how to aggregate editors across projects.
On Mon, Aug 25, 2014 at 10:55 AM, Yuvi Panda yuvipanda@gmail.com wrote:
Yay to more people finding it useful :)
Editors / Active editors isn't too hard to had programatically. The bigger problem is how to define 'editor from country' - one edit from that country? Does that mean that one editor can be considered to be from multiple countries? Do we double count mobile and desktop as separate?
An easy way to do this would be:
- An 'editor from a country' is someone who has made at least one
edit from that country 2. A 'desktop editor from a country' is someone who has made at least one edit from that country on desktop 3. A 'mobile editor from a country' is someone who has made at least one edit from that country on mobile
This muddles the data some what, since sum(editors_from_all_countries_for_a_project) != total_editors_for_project, and also sum(mobile_editors, desktop_editors) per country != total_editors per country. However, this is super simple to implement and also still useful, so I might end up doing that.
Of course, assuming this entire thing gets OK'd fully by analytics :)
On Mon, Aug 25, 2014 at 6:14 PM, Jessie Wild jwild@wikimedia.org wrote:
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have
publicly
available. A lot of the initiatives we fund are at a country-based
level,
and our partners have a really hard time understanding the effects of
the
work they are doing on the aggregate language-wiki level. In addition
to
this edits per country, it would be even more important for us to get
the
total number of editors / active editors by country as well. Kevin - it would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything
about it
yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda yuvipanda@gmail.com
wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org
wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear
from
someone else @ wikimedia who has similar experience anonymizing data sets
Glad to hear that!
- you were probably already thinking about it, but we need
documentation
too: a wikipage with the name of the table, data dictionary, etc...
and
even a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jessie Wild Sneller Grantmaking Learning & Evaluation Wikimedia Foundation
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Katherine Maher Chief Communications Officer Wikimedia Foundation 149 New Montgomery Street San Francisco, CA 94105
+1 (415) 839-6885 ext. 6635 +1 (415) 712 4873 kmaher@wikimedia.org
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hello,
I report this information to organizations already. I get data percentages from third parties, like shown here http://www.alexa.com/siteinfo/wikipedia.org
and multiply their percentages by other estimates that they share for various regions of a country. In this way, I provide estimates of Wikipedia traffic to a set of articles in a region of a country.
Of course this must have low accuracy, but it is a guess that I can make using the best data available to me. I would love to have access to anything more accurate.
I work in the health sector and I can confirm that a major reason why more health organizations do not contribute to Wikipedia is because convincing evidence does not exist to demonstrate that significant numbers of people use Wikipedia to seek health information. The bar they would want to see is quite low and it seems obvious to me that Wikipedia must be serving a larger audience than they are reaching in the other Internet outlets they use.
If you make a project page on wiki for your idea then please share.
yours,
On Fri, Aug 29, 2014 at 7:44 PM, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Folks -- sorry for the delay in responding.
While this data is awesome, we need to review the anonymization carefully. We once shared this data in dashboards and found some privacy issues so we needed to take it down.
We have an action item to review this issue with legal. I will follow up next week.
-Toby
On Mon, Aug 25, 2014 at 4:23 PM, Katherine Maher kmaher@wikimedia.org wrote:
I would just like to second Jessie's enthusiasm: this would be helpful to have available for certain media requests, particularly for journalists from smaller countries that want to write profiles about the degree of engagement their fellow citizens have with Wikipedia.
Of course, we may find some disappointing stories in there as well, particularly underscoring the lack of editors from global south countries, so it will be important to deliver a good explanation for why anonymization means we have no useable data on certain nations. In any case, the results will be interesting.
On Mon, Aug 25, 2014 at 1:26 PM, Kevin Leduc kevin@wikimedia.org wrote:
I know the researchers have already put some thought around how to break down editors by country. I think this falls under metric standardization and we want to have a consistent way of counting editors who edit across several projects. There is also an expectation that as you aggregate numbers, editors are deduplicated (not counted twice).
I don't know if we are totally prepared to have this conversation yet. My priority list has discussions on how to break down editors by target site (desktop, mobile, API) first, then how to aggregate editors across projects.
On Mon, Aug 25, 2014 at 10:55 AM, Yuvi Panda yuvipanda@gmail.com wrote:
Yay to more people finding it useful :)
Editors / Active editors isn't too hard to had programatically. The bigger problem is how to define 'editor from country' - one edit from that country? Does that mean that one editor can be considered to be from multiple countries? Do we double count mobile and desktop as separate?
An easy way to do this would be:
- An 'editor from a country' is someone who has made at least one
edit from that country 2. A 'desktop editor from a country' is someone who has made at least one edit from that country on desktop 3. A 'mobile editor from a country' is someone who has made at least one edit from that country on mobile
This muddles the data some what, since sum(editors_from_all_countries_for_a_project) != total_editors_for_project, and also sum(mobile_editors, desktop_editors) per country != total_editors per country. However, this is super simple to implement and also still useful, so I might end up doing that.
Of course, assuming this entire thing gets OK'd fully by analytics :)
On Mon, Aug 25, 2014 at 6:14 PM, Jessie Wild jwild@wikimedia.org wrote:
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have
publicly
available. A lot of the initiatives we fund are at a country-based
level,
and our partners have a really hard time understanding the effects of
the
work they are doing on the aggregate language-wiki level. In addition
to
this edits per country, it would be even more important for us to get
the
total number of editors / active editors by country as well. Kevin -
it
would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything
about it
yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda yuvipanda@gmail.com
wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc kevin@wikimedia.org
wrote:
> Hey Yuvi, > > this sounds like very interesting data to look at. Here are my > thoughts:
:D
> - the Anonymization scheme sounds reasonable, and I'd like to hear
from
> someone else @ wikimedia who has similar experience anonymizing
data
> sets
Glad to hear that!
> - you were probably already thinking about it, but we need
documentation
> too: a wikipage with the name of the table, data dictionary,
etc... and
> even > a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jessie Wild Sneller Grantmaking Learning & Evaluation Wikimedia Foundation
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Katherine Maher Chief Communications Officer Wikimedia Foundation 149 New Montgomery Street San Francisco, CA 94105
+1 (415) 839-6885 ext. 6635 +1 (415) 712 4873 kmaher@wikimedia.org
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
FWIW: depending on the threshold chosen in step 2 of Anonymization suggested by Yuvi, some of the countries/languages will have no data. This data will solve the problem for some of the partners, but not all of them.
On Monday, August 25, 2014, Jessie Wild jwild@wikimedia.org wrote:
THIS IS SO USEFUL!
For grantmaking, this is the exact type of dataset we want to have publicly available. A lot of the initiatives we fund are at a country-based level, and our partners have a really hard time understanding the effects of the work they are doing on the aggregate language-wiki level. In addition to this edits per country, it would be even more important for us to get the total number of editors / active editors by country as well. Kevin - it would be great to get an update from on the timeline for this (in Q4 2014-15, it was punted to Q1 2014-15, but I haven't heard anything about it yet ...)
Thanks for starting this work, Yuvi! Jessie
On Mon, Aug 25, 2014 at 9:43 AM, Yuvi Panda <yuvipanda@gmail.com javascript:_e(%7B%7D,'cvml','yuvipanda@gmail.com');> wrote:
On Mon, Aug 25, 2014 at 5:41 PM, Kevin Leduc <kevin@wikimedia.org javascript:_e(%7B%7D,'cvml','kevin@wikimedia.org');> wrote:
Hey Yuvi,
this sounds like very interesting data to look at. Here are my
thoughts:
:D
- the Anonymization scheme sounds reasonable, and I'd like to hear from
someone else @ wikimedia who has similar experience anonymizing data
sets
Glad to hear that!
- you were probably already thinking about it, but we need documentation
too: a wikipage with the name of the table, data dictionary, etc... and
even
a blog post to announce the newly available data.
Oh yeah, definitely. Will come once the code, etc is done :)
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Analytics@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/analytics
--
*Jessie Wild SnellerGrantmaking Learning & Evaluation * *Wikimedia Foundation*
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia https://donate.wikimedia.org/
Hi Yuvi,
Maybe you draw some inspiration for meta data from http://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryOv...
Cheers, Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics- bounces@lists.wikimedia.org] On Behalf Of Yuvi Panda Sent: Monday, August 25, 2014 2:22 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: [Analytics] Anonymizing and releasing 'edits per country' data for Wiki Projects
Hello!
I've been working for the last few days on https://github.com/Ironholds/WPDMZ, which currently generates raw data on 'number of non-bot edits per country', and I'd like to run some stats / make some graphs based on it. Since I'd like al l my 'research' to be completely repeatable, I'd love it if we can make the 'raw data' (edits per country) publicly available on labsdb. I have most of the code written for it, *but* it needs anonymization.
The biggest de-anonymization threats involve identifying which editors come from which countries, and can be executed in the following case:
An editor is the only person editing from a country in a project where the country has low edit volume, and by a process of elimination / counting edits from a public source (like recentchanges), the individual editor can be connected to a particular country
I propose the following Anonymization scheme:
- No data for projects with less than a threshold of total *individual editors*
in the time period for which the data is released. 2. For countries that have less than a threshold % of 'individual editors' in the time period, we just simply lump them in as 'other'.
This removes most anonymization attacks I can think of. Thoughts? I can easily write up the code to generate these on a monthly basis and puppetize those to make the data publicly available. I think not just me, but lots of external researchers would benefit from such data.
Thanks!
-- Yuvi Panda T http://yuvi.in/blog
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics