Erik,
Thanks a lot for the appreciation.
As Sajjad mentioned, we have already obtained a edit-per-location dataset from Evan (Rosen) that has the following column structure:
*language,country,city,start,end,fraction,ts*
*start* and *end* denote the beginning and ending date for counting the number of edits, and *ts* is time stamp.
The *fraction*, however, gives a national ratio of edit activity, that is it gives the ratio of 'total edits from that city for that language Wikipedia project' divided 'total edits from that country for that language Wikipedia project'. Hence, it cannot be used to understand global edit contributions to a Wikipedia project (for a time period).
It seems that the original data (from where this dataset is extracted) should also have the global fractions -- total edit from a city divided by total edit from the whole world, for a project, for a time period.
Would you know if the global fractions can also be derived from the XML dumps? Or, even better, is the relevant raw data available in CSV form somewhere else?
Bests,
sumandro
-------------
sumandro ajantriks.net
On Wednesday 15 May 2013 12:32 AM, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Date: Tue, 14 May 2013 19:40:00 +0200 From: "Erik Zachte" ezachte@wikimedia.org To: "'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'" analytics@lists.wikimedia.org Subject: Re: [Analytics] Visualizing Indic Wikipedia projects. Message-ID: 016f01ce50ca$0fe736b0$2fb5a410$@wikimedia.org Content-Type: text/plain; charset="iso-8859-1"
Awesome work! I like the flexibility of the charts, easy to switch metrics and presentation mode.
- WMF has never captured ip->geo data on city level, but afaik this is
going to change with Kraken.
- Total edits per article per year can be derived from the xml dumps. I may
have some csv data that come in handy.
For edit wars you need track reverts on an per article basis, right? That can also be derived from dumps.
For long history you need full archive dumps and need to calc checksum per revision text. (stub dumps have checksum but only for last year or two)
Erik Zachte
Hi sumandro, I've worked with this data generated by ERosen looking for ptwiki stats and I think I can help you.
Given a period of time you can get the total edits of a country and the count for edits in all countries in that period. With this data you can generate the "country fraction" and then if you multiply the city fraction by its country fraction, you get the city "global fraction".
Best,
Henrique Andrade
On Tue, May 14, 2013 at 5:43 PM, sumandro mail@ajantriks.net wrote:
Erik,
Thanks a lot for the appreciation.
As Sajjad mentioned, we have already obtained a edit-per-location dataset from Evan (Rosen) that has the following column structure:
*language,country,city,start,**end,fraction,ts*
*start* and *end* denote the beginning and ending date for counting the number of edits, and *ts* is time stamp.
The *fraction*, however, gives a national ratio of edit activity, that is it gives the ratio of 'total edits from that city for that language Wikipedia project' divided 'total edits from that country for that language Wikipedia project'. Hence, it cannot be used to understand global edit contributions to a Wikipedia project (for a time period).
It seems that the original data (from where this dataset is extracted) should also have the global fractions -- total edit from a city divided by total edit from the whole world, for a project, for a time period.
Would you know if the global fractions can also be derived from the XML dumps? Or, even better, is the relevant raw data available in CSV form somewhere else?
Bests,
sumandro
sumandro ajantriks.net
On Wednesday 15 May 2013 12:32 AM, analytics-request@lists.**wikimedia.organalytics-request@lists.wikimedia.orgwrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.**wikimedia.organalytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.**wikimedia.organalytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
------------------------------**------------------------------**
Date: Tue, 14 May 2013 19:40:00 +0200 From: "Erik Zachte" ezachte@wikimedia.org
To: "'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'" <analytics@lists.wikimedia.org**> Subject: Re: [Analytics] Visualizing Indic Wikipedia projects. Message-ID: 016f01ce50ca$0fe736b0$**2fb5a410$@wikimedia.org Content-Type: text/plain; charset="iso-8859-1"
Awesome work! I like the flexibility of the charts, easy to switch metrics and presentation mode.
- WMF has never captured ip->geo data on city level, but afaik this is
going to change with Kraken.
- Total edits per article per year can be derived from the xml dumps. I
may have some csv data that come in handy.
For edit wars you need track reverts on an per article basis, right? That can also be derived from dumps.
For long history you need full archive dumps and need to calc checksum per revision text. (stub dumps have checksum but only for last year or two)
Erik Zachte
______________________________**_________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/analyticshttps://lists.wikimedia.org/mailman/listinfo/analytics
Hello everyone,
It's been a while since we left this discussion. Are there any updates regarding the edits by geography dataset? We want to look at edits/contributions to the Indic Wikipedia projects from 2008. It would be great if we can get hold of it to make some maps.
Thanks.
Sajjad.
Hello everyone,
You may remember that we were looking for edits by geography (city level) from 2008 for the Indic Wikipedia project (http://geohacker.github.io/indicwiki/). We were wondering if any of you would know where the dataset concerned can be accessed from.
Just to clarify, we are looking for absolute number of monthly edits for each Wikipedia project from each location since 2008.
Thanks.
Cheers, Sajjad.
On Mon, Sep 16, 2013 at 12:56 PM, Sajjad Anwar me@sajjad.in wrote:
Hello everyone,
It's been a while since we left this discussion. Are there any updates regarding the edits by geography dataset? We want to look at edits/contributions to the Indic Wikipedia projects from 2008. It would be great if we can get hold of it to make some maps.
Thanks.
Sajjad.
Hi Sajjad,
In the time since you last emailed, I think we've basically determined that providing this kind of data is contrary to our goals of protecting individual editors' identity. Information from the recent_changes stream can be combined with such geolocation data to pinpoint where certain editors live. We have put that project on hold to some extent, but are actively trying to figure out how to best aggregate and sanitize this data. While nothing is going to be available in the short term, the projects to look at / collaborate with are:
* Event Logging * Wikimetrics
If you're interested in some very basic estimates, geolocating anonymous recent_changes by their IP could help you with that. I know you guys have done some work in this area already, I thought it was worth mentioning for completeness.
-Toby
On Wed, Mar 12, 2014 at 11:06 AM, Sajjad Anwar me@sajjad.in wrote:
Hello everyone,
You may remember that we were looking for edits by geography (city level) from 2008 for the Indic Wikipedia project (http://geohacker.github.io/indicwiki/). We were wondering if any of you would know where the dataset concerned can be accessed from.
Just to clarify, we are looking for absolute number of monthly edits for each Wikipedia project from each location since 2008.
Thanks.
Cheers, Sajjad.
On Mon, Sep 16, 2013 at 12:56 PM, Sajjad Anwar me@sajjad.in wrote:
Hello everyone,
It's been a while since we left this discussion. Are there any updates regarding the edits by geography dataset? We want to look at edits/contributions to the Indic Wikipedia projects from 2008. It would be great if we can get hold of it to make some maps.
Thanks.
Sajjad.
-- Sajjad Anwar http://geohacker.in
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
For that matter, surely this data won't exist anyway before 2013 or so? I'm not sure how long we retain IP data for logged-in users, but I'd be a bit startled if it was five years.
Andrew. On 12 Mar 2014 23:30, "Toby Negrin" tnegrin@wikimedia.org wrote:
Hi Sajjad,
In the time since you last emailed, I think we've basically determined that providing this kind of data is contrary to our goals of protecting individual editors' identity. Information from the recent_changes stream can be combined with such geolocation data to pinpoint where certain editors live. We have put that project on hold to some extent, but are actively trying to figure out how to best aggregate and sanitize this data. While nothing is going to be available in the short term, the projects to look at / collaborate with are:
- Event Logging
- Wikimetrics
If you're interested in some very basic estimates, geolocating anonymous recent_changes by their IP could help you with that. I know you guys have done some work in this area already, I thought it was worth mentioning for completeness.
-Toby
On Wed, Mar 12, 2014 at 11:06 AM, Sajjad Anwar me@sajjad.in wrote:
Hello everyone,
You may remember that we were looking for edits by geography (city level) from 2008 for the Indic Wikipedia project (http://geohacker.github.io/indicwiki/). We were wondering if any of you would know where the dataset concerned can be accessed from.
Just to clarify, we are looking for absolute number of monthly edits for each Wikipedia project from each location since 2008.
Thanks.
Cheers, Sajjad.
On Mon, Sep 16, 2013 at 12:56 PM, Sajjad Anwar me@sajjad.in wrote:
Hello everyone,
It's been a while since we left this discussion. Are there any updates regarding the edits by geography dataset? We want to look at edits/contributions to the Indic Wikipedia projects from 2008. It would be great if we can get hold of it to make some maps.
Thanks.
Sajjad.
-- Sajjad Anwar http://geohacker.in
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Andrew Gray, 13/03/2014 00:56:
For that matter, surely this data won't exist anyway before 2013 or so? I'm not sure how long we retain IP data for logged-in users, but I'd be a bit startled if it was five years.
EventLogging can contain almost anything I think. Is there any purging? I don't think so. Is it aggregate and anonymised? No longer. https://www.mediawiki.org/w/index.php?title=Extension:EventLogging&diff=prev&oldid=905171
Nemo
EventLogging can contain almost anything I think. Is there any purging? I
don't think so. Is it aggregate and anonymised? No longer.
Sorry but this is not correct: IP addresses are anonymized in Event Logging and they always have been so. We calculate a HMAC with a rotating salt that changes either every 90 days or with a service restart.
Event Logging data has never been aggregated, it is a system to log discrete events. There had not been any changes on this regard as of late.
On Thu, Mar 13, 2014 at 9:32 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Andrew Gray, 13/03/2014 00:56:
For that matter, surely this data won't exist anyway before 2013 or so?
I'm not sure how long we retain IP data for logged-in users, but I'd be a bit startled if it was five years.
EventLogging can contain almost anything I think. Is there any purging? I don't think so. Is it aggregate and anonymised? No longer. < https://www.mediawiki.org/w/index.php?title=Extension: EventLogging&diff=prev&oldid=905171>
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Mar 13, 2014 at 9:32 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Andrew Gray, 13/03/2014 00:56:
For that matter, surely this data won't exist anyway before 2013 or so?
I'm not sure how long we retain IP data for logged-in users, but I'd be a bit startled if it was five years.
EventLogging can contain almost anything I think. Is there any purging? I don't think so. Is it aggregate and anonymised? No longer. < https://www.mediawiki.org/w/index.php?title=Extension: EventLogging&diff=prev&oldid=905171>
On Thu, Mar 13, 2014 at 5:19 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Sorry but this is not correct: IP addresses are anonymized in Event Logging and they always have been so. We calculate a HMAC with a rotating salt that changes either every 90 days or with a service restart.
Event Logging data has never been aggregated, it is a system to log discrete events. There had not been any changes on this regard as of late.
What Nuria said is correct, however, we do store some data, such as User Agents currently. This is not our intention for the long term, we are in the middle of putting in place a sanitization strategy to get rid of any PII after 90 days. This discussion might make more sense in another thread though, kindly please do not hijack Sajjad's thread :)
Thanks for the quick response Toby.
On Thu, Mar 13, 2014 at 5:00 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Sajjad,
In the time since you last emailed, I think we've basically determined that providing this kind of data is contrary to our goals of protecting individual editors' identity. Information from the recent_changes stream can be combined with such geolocation data to pinpoint where certain editors live. We have put that project on hold to some extent, but are actively trying to figure out how to best aggregate and sanitize this data. While nothing is going to be available in the short term, the projects to look at / collaborate with are:
I see. We are not looking to have individual edit history with IP by location, but aggregated per city for each year. I believe that this aggregation could be anonymised? If that's not a possibility, can we get aggregated edits at the country level?
Thanks again.
Cheers, Sajjad.
I see. We are not looking to have individual edit history with IP by location, but aggregated per city for each year. I believe that this aggregation could be anonymised?
Unfortunately, what we mean is that, even aggregated, this data could still be used to identify people. This is because you could combine multiple data sources and infer a lot, especially from very active editors on smaller projects.
If that's not a possibility, can we get aggregated edits at the country level?
Actually, what Toby was describing above was at a country level. Even that is too sensitive to do broadly. But we are looking at figuring out how to publish aggregated data that is "aggregated enough". Which means we have to determine at what point the different streams of information we have can not be used in combination to attack a person's identity. As a side note, city geolocation is only 80% accurate from what I heard last, and it's especially inaccurate in places like India. So if you do end up using anonymous IP geolocation you should take a deeper look into that.