Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare
Hi James,
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.
Will that work?
-Toby
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
That will work. Cheers!
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:
Hi James,
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.
Will that work?
-Toby
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote: Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Some thought on this:
We have been discussing adding new geo data for a long time.
I lost track of current status and latest decisions but FWIW a year ago this was the idea for squid log:
We thought of replacing ip address by a composite field (using a different delimiter than the field delimiter).
The field could look like this:
4|hash code|CL||Santiago|-33.5,-70.5
6|hash code|US|CA|San Francisco|-37.5,122.5
Where 4 or 6 is the #triplets in ip address.
Hash code is anonimized ip address.
Country code as used by MaxMind ( http://dev.maxmind.com/geoip/legacy/codes/iso3166/ )
Region/state when available or else empty string (*)
City name when available or else empty string ( http://www.maxmind.com/GeoIPCity-534-Location.csv )
Lastly follow latitude/longitude, rounded on purpose. This gives resolution of at best 55 km or 30 mi resolution, depending on latitude, to ensure anonimization particularly for edits. Otherwise a very active editor in a sparsely populated region of say China could easily be matched with edit timestamps from dumps.
* Caveat:
Supplying region code requires 'external lookup' as MaxMind puts it. ( http://www.maxmind.com/en/city )
This is probably a costly operation.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Sunday, August 11, 2013 1:55 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
That will work. Cheers!
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:
Hi James,
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.
Will that work?
-Toby
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Forgive me if I'm misunderstanding but wouldn't a set up like this (even anonimized as described above) allow someone to recover the location of an individual editor on sparsely edited wiki's?
If we're just looking to provide a convenient lookup for IP editors what is the advantage of doing this over requiring researchers to use publicly available IP databases to perform geolocation?
Adam Hyland
Developer at Bocoup Web: http://shift command awesome.com On Aug 12, 2013 12:46 PM, "Erik Zachte" ezachte@wikimedia.org wrote:
Some thought on this:****
We have been discussing adding new geo data for a long time. ****
I lost track of current status and latest decisions but FWIW a year ago this was the idea for squid log: ****
We thought of replacing ip address by a composite field (using a different delimiter than the field delimiter).****
The field could look like this:****
4|hash code|CL||Santiago|-33.5,-70.5****
6|hash code|US|CA|San Francisco|-37.5,122.5****
Where 4 or 6 is the #triplets in ip address. ****
Hash code is anonimized ip address. ****
Country code as used by MaxMind ( http://dev.maxmind.com/geoip/legacy/codes/iso3166/ )****
Region/state when available or else empty string (*)****
City name when available or else empty string ( http://www.maxmind.com/GeoIPCity-534-Location.csv )****
Lastly follow latitude/longitude, rounded on purpose. This gives resolution of at best 55 km or 30 mi resolution, depending on latitude, to ensure anonimization particularly for edits. Otherwise a very active editor in a sparsely populated region of say China could easily be matched with edit timestamps from dumps. ****
- Caveat: ****
Supplying region code requires 'external lookup' as MaxMind puts it. ( http://www.maxmind.com/en/city )****
This is probably a costly operation. ****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *James Hare *Sent:* Sunday, August 11, 2013 1:55 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] U.S. state-level editor retention data****
That will work. Cheers!****
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:****
Hi James,****
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.****
Will that work?****
-Toby****
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:****
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes it would be easy to couple user names in a sparsely edited wiki to geo info (or for that matter the name of a very active editor on a very busy wiki).
So the point is that geo info should be inherently vague. As MaxMInd only has city names for places with 100's of thousands if not millions inhabitants that is not a give-away. Likewise with rounded lat/long.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Adam Hyland Sent: Monday, August 12, 2013 7:06 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
Forgive me if I'm misunderstanding but wouldn't a set up like this (even anonimized as described above) allow someone to recover the location of an individual editor on sparsely edited wiki's?
If we're just looking to provide a convenient lookup for IP editors what is the advantage of doing this over requiring researchers to use publicly available IP databases to perform geolocation?
Adam Hyland
Developer at Bocoup Web: http://shift command awesome.com
On Aug 12, 2013 12:46 PM, "Erik Zachte" ezachte@wikimedia.org wrote:
Some thought on this:
We have been discussing adding new geo data for a long time.
I lost track of current status and latest decisions but FWIW a year ago this was the idea for squid log:
We thought of replacing ip address by a composite field (using a different delimiter than the field delimiter).
The field could look like this:
4|hash code|CL||Santiago|-33.5,-70.5
6|hash code|US|CA|San Francisco|-37.5,122.5
Where 4 or 6 is the #triplets in ip address.
Hash code is anonimized ip address.
Country code as used by MaxMind ( http://dev.maxmind.com/geoip/legacy/codes/iso3166/ )
Region/state when available or else empty string (*)
City name when available or else empty string ( http://www.maxmind.com/GeoIPCity-534-Location.csv )
Lastly follow latitude/longitude, rounded on purpose. This gives resolution of at best 55 km or 30 mi resolution, depending on latitude, to ensure anonimization particularly for edits. Otherwise a very active editor in a sparsely populated region of say China could easily be matched with edit timestamps from dumps.
* Caveat:
Supplying region code requires 'external lookup' as MaxMind puts it. ( http://www.maxmind.com/en/city )
This is probably a costly operation.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Sunday, August 11, 2013 1:55 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
That will work. Cheers!
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:
Hi James,
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.
Will that work?
-Toby
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Mon, Aug 12, 2013 at 6:46 PM, Erik Zachte ezachte@wikimedia.org wrote:
Some thought on this:****
We have been discussing adding new geo data for a long time. ****
I lost track of current status and latest decisions but FWIW a year ago this was the idea for squid log: ****
We thought of replacing ip address by a composite field (using a different delimiter than the field delimiter).****
The field could look like this:****
4|hash code|CL||Santiago|-33.5,-70.5****
6|hash code|US|CA|San Francisco|-37.5,122.5****
**
**
Where 4 or 6 is the #triplets in ip address. ****
Hash code is anonimized ip address. ****
Country code as used by MaxMind ( http://dev.maxmind.com/geoip/legacy/codes/iso3166/ )****
Region/state when available or else empty string (*)****
City name when available or else empty string ( http://www.maxmind.com/GeoIPCity-534-Location.csv )****
Lastly follow latitude/longitude, rounded on purpose. This gives resolution of at best 55 km or 30 mi resolution, depending on latitude, to ensure anonimization particularly for edits. Otherwise a very active editor in a sparsely populated region of say China could easily be matched with edit timestamps from dumps.
I don't think we should get too hung up on the specific format right now, I am really not sure if a composite field is the best implementation and at what level we want to geocode. But more importantly, I think that two issues get mixed up here: geocoding of readers and geocoding of editors.
It was my understanding that the original request pertained to geocoding of editors (if that's not the case then my advance apologies).
@James: can you confirm that we are talking about geocoding of editors? D
**
- Caveat: ****
Supplying region code requires 'external lookup' as MaxMind puts it. ( http://www.maxmind.com/en/city )****
This is probably a costly operation. ****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *James Hare *Sent:* Sunday, August 11, 2013 1:55 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] U.S. state-level editor retention data****
That will work. Cheers!****
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:****
Hi James,****
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.****
Will that work?****
-Toby****
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:****
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Aug 13, 2013, at 9:23 AM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Mon, Aug 12, 2013 at 6:46 PM, Erik Zachte ezachte@wikimedia.org wrote:
Some thought on this:
We have been discussing adding new geo data for a long time.
I lost track of current status and latest decisions but FWIW a year ago this was the idea for squid log:
We thought of replacing ip address by a composite field (using a different delimiter than the field delimiter).
The field could look like this:
4|hash code|CL||Santiago|-33.5,-70.5
6|hash code|US|CA|San Francisco|-37.5,122.5
Where 4 or 6 is the #triplets in ip address.
Hash code is anonimized ip address.
Country code as used by MaxMind ( http://dev.maxmind.com/geoip/legacy/codes/iso3166/ )
Region/state when available or else empty string (*)
City name when available or else empty string ( http://www.maxmind.com/GeoIPCity-534-Location.csv )
Lastly follow latitude/longitude, rounded on purpose. This gives resolution of at best 55 km or 30 mi resolution, depending on latitude, to ensure anonimization particularly for edits. Otherwise a very active editor in a sparsely populated region of say China could easily be matched with edit timestamps from dumps.
I don't think we should get too hung up on the specific format right now, I am really not sure if a composite field is the best implementation and at what level we want to geocode. But more importantly, I think that two issues get mixed up here: geocoding of readers and geocoding of editors.
It was my understanding that the original request pertained to geocoding of editors (if that's not the case then my advance apologies).
@James: can you confirm that we are talking about geocoding of editors? D
That is correct. Also, if it helps, I don't necessarily need *city*-level information, just state. (For the purposes of this discussion, DC is a state since its stats would not be aggregated with any other state's.)
James
- Caveat:
Supplying region code requires 'external lookup' as MaxMind puts it. ( http://www.maxmind.com/en/city )
This is probably a costly operation.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Sunday, August 11, 2013 1:55 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
That will work. Cheers!
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:
Hi James,
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.
Will that work?
-Toby
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It was my understanding that the original request pertained to geocoding of editors (if that's not the case then my advance apologies).
@James: can you confirm that we are talking about geocoding of editors? D
That is correct. Also, if it helps, I don't necessarily need *city*-level information, just state. (For the purposes of this discussion, DC is a state since its stats would not be aggregated with any other state's.)
James
Hi James,
In general, we are very cautious with geocoding editors and particularly at a more granular level than the country level and even more cautious when this data will be published. From a technical point of view, you could already do it for anonymous editors as their ip addresses are published on the Wiki itself and in the XML dump files. For logged-in editors we would have to rely on the RecentChanges table (see http://www.mediawiki.org/wiki/Manual:Recentchanges_table). However, data in this table is only accessible for users with the checkuser permission ( http://meta.wikimedia.org/wiki/CheckUser_policy#CheckUser_status). Hence, we cannot use this source to geocode editors. Even if the data was available from a source without such restrictions, then we would still have restrictions from the WMF Privacy Policy and community expectations regarding the geocoding of ip addresses.
I am afraid that we have to reject this request based on the fact that we do not collect this data in a publicly available table and that geocoding publishing geocoded editor information would violate the Privacy Policy of the WMF and not match with community expectations regarding the geocoding of ip addresses.
Maybe we can continue this discussion to see if we can come up with alternative solutions to your problem?
Best, Diederik
Diederik,
Ah I see where the confusion comes form.
My story is, as I said, about squid logs where views and edits both coexist.
Your focus is on recent changes list.
And about publishing, that is why it is important to make geo data not too pinpoint exact location, see earlier mail.
Also I don't think James wants raw data , he want aggregates based on these data. Right James?
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Tuesday, August 13, 2013 5:34 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
It was my understanding that the original request pertained to geocoding of editors (if that's not the case then my advance apologies).
@James: can you confirm that we are talking about geocoding of editors?
D
That is correct. Also, if it helps, I don't necessarily need *city*-level information, just state. (For the purposes of this discussion, DC is a state since its stats would not be aggregated with any other state's.)
James
Hi James,
In general, we are very cautious with geocoding editors and particularly at a more granular level than the country level and even more cautious when this data will be published. From a technical point of view, you could already do it for anonymous editors as their ip addresses are published on the Wiki itself and in the XML dump files. For logged-in editors we would have to rely on the RecentChanges table (see http://www.mediawiki.org/wiki/Manual:Recentchanges_table). However, data in this table is only accessible for users with the checkuser permission (http://meta.wikimedia.org/wiki/CheckUser_policy#CheckUser_status). Hence, we cannot use this source to geocode editors. Even if the data was available from a source without such restrictions, then we would still have restrictions from the WMF Privacy Policy and community expectations regarding the geocoding of ip addresses.
I am afraid that we have to reject this request based on the fact that we do not collect this data in a publicly available table and that geocoding publishing geocoded editor information would violate the Privacy Policy of the WMF and not match with community expectations regarding the geocoding of ip addresses.
Maybe we can continue this discussion to see if we can come up with alternative solutions to your problem?
Best,
Diederik
On Aug 13, 2013, at 11:41 AM, "Erik Zachte" ezachte@wikimedia.org wrote:
Diederik,
Ah I see where the confusion comes form. My story is, as I said, about squid logs where views and edits both coexist. Your focus is on recent changes list.
And about publishing, that is why it is important to make geo data not too pinpoint exact location, see earlier mail.
Also I don't think James wants raw data , he want aggregates based on these data. Right James?
Indeed. Stuff like "the United States made 100,000 edits, of which 15,000 were made in Virginia," etc. I don't need (or want!) to know about individual data points. Just information already generally published about editing as a whole on the national level.
James
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Tuesday, August 13, 2013 5:34 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
It was my understanding that the original request pertained to geocoding of editors (if that's not the case then my advance apologies).
@James: can you confirm that we are talking about geocoding of editors? D
That is correct. Also, if it helps, I don't necessarily need *city*-level information, just state. (For the purposes of this discussion, DC is a state since its stats would not be aggregated with any other state's.)
James
Hi James,
In general, we are very cautious with geocoding editors and particularly at a more granular level than the country level and even more cautious when this data will be published. From a technical point of view, you could already do it for anonymous editors as their ip addresses are published on the Wiki itself and in the XML dump files. For logged-in editors we would have to rely on the RecentChanges table (see http://www.mediawiki.org/wiki/Manual:Recentchanges_table). However, data in this table is only accessible for users with the checkuser permission (http://meta.wikimedia.org/wiki/CheckUser_policy#CheckUser_status). Hence, we cannot use this source to geocode editors. Even if the data was available from a source without such restrictions, then we would still have restrictions from the WMF Privacy Policy and community expectations regarding the geocoding of ip addresses.
I am afraid that we have to reject this request based on the fact that we do not collect this data in a publicly available table and that geocoding publishing geocoded editor information would violate the Privacy Policy of the WMF and not match with community expectations regarding the geocoding of ip addresses.
Maybe we can continue this discussion to see if we can come up with alternative solutions to your problem?
Best, Diederik
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
James Hare, 13/08/2013 21:15:
On Aug 13, 2013, at 11:41 AM, "Erik Zachte"
Also I don't think James wants raw data , he want aggregates based on these data. Right James?
Indeed. Stuff like "the United States made 100,000 edits, of which 15,000 were made in Virginia," etc. I don't need (or want!) to know about individual data points.
And we already have some aggregated data about editors on stats.wikimedia.org squid repots, so it's surely not a privacy issue.
Nemo
On Tue, Aug 13, 2013 at 1:45 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
And we already have some aggregated data about editors on stats.wikimedia.org squid repots, so it's surely not a privacy issue.
I'd be worried about using aggregation as a cureall, when, as others have pointed out, we have some very small wikis. But it can be done, especially when you check to make sure that (at whatever granularity you use for the geodata and timestamps) the resulting aggregated sets are always reasonably large.
Luis
On Aug 13, 2013, at 6:06 PM, Luis Villa lvilla@wikimedia.org wrote:
On Tue, Aug 13, 2013 at 1:45 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
And we already have some aggregated data about editors on stats.wikimedia.org squid repots, so it's surely not a privacy issue.
I'd be worried about using aggregation as a cureall, when, as others have pointed out, we have some very small wikis. But it can be done, especially when you check to make sure that (at whatever granularity you use for the geodata and timestamps) the resulting aggregated sets are always reasonably large.
Luis
Would it relieve some of the concerns if we limited publishing of subnational data to particularly large countries, like the United States, and particularly large projects, like the English Wikipedia?
James
-- Luis Villa Deputy General Counsel Wikimedia Foundation 415.839.6885 ext. 6810
NOTICE: This message may be confidential or legally privileged. If you have received it by accident, please delete it and let us know about the mistake. As an attorney for the Wikimedia Foundation, for legal/ethical reasons I cannot give legal advice to, or serve as a lawyer for, community members, volunteers, or staff members in their personal capacity. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Would it relieve some of the concerns if we limited publishing of subnational data to particularly large countries, like the United States, and particularly large projects, like the English Wikipedia?
The size of the project is irrelevant. Even on wp:en it would be rather trivial to find the geo data for any very active editor, by matching timestamps in the squid log with timestamps in the dump or recent changes list. Of course we don't publish squid logs. But let us assess risk when data do leak or are exposed otherwise. Then it is important those geo data are *sufficiently non-specific*. For me that's the issue we should focus on.
--
The city names which MaxMind keeps track of is a limited list ( http://www.maxmind.com/GeoIPCity-534-Location.csv ) Of course it may expand.
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)
So we could build a white list from it which expands over time. Of course that would be another lookup.
As for latitude/longitude, again, these should be rounded on purpose.
If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Wednesday, August 14, 2013 12:13 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
On Aug 13, 2013, at 6:06 PM, Luis Villa lvilla@wikimedia.org wrote:
On Tue, Aug 13, 2013 at 1:45 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
And we already have some aggregated data about editors on stats.wikimedia.org squid repots, so it's surely not a privacy issue.
I'd be worried about using aggregation as a cureall, when, as others have pointed out, we have some very small wikis. But it can be done, especially when you check to make sure that (at whatever granularity you use for the geodata and timestamps) the resulting aggregated sets are always reasonably large.
Luis
Would it relieve some of the concerns if we limited publishing of subnational data to particularly large countries, like the United States, and particularly large projects, like the English Wikipedia?
James
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)****
I'm not sure that would always provide the safety we're looking for. Because police work, by a nefarious agent in a city with 100,000 people, would quite easily lead to the identity of a specific editor.
As for latitude/longitude, again, these should be rounded on purpose. ****
If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.*
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Unfortunately, I think 30 miles would not provide enough anonymity in China because in some 30 mile areas there may only be a few small villages. Also unfortunately 30 miles would not provide the accuracy that James needs to capture Washington D.C. activity, because any log line would show up in Maryland, Virginia, and D.C. simultaneously.
I think we have to turn this request on its head a little bit and think about the people who are going to be potentially identified. We somehow have to get their permission to analyze this data. If you look at any other geo-analysis being performed by Apple, Google, etc. this is not unusual - they always ask permission from the end user being tracked. We could ask permission in the same way, but maybe find a way to be less creepy than the typical Google approach.
I can understand both caveats of data being still too specific: city > 100k, lat/long rounded 0.5 degree. Of course any information (even west vs east hemisphere) is a least of some help to a nefarious agent, so it's more a matter of how much exposure is deemed acceptable risk.
Would a city > 500 k be acceptable? Would rounding to 1 degree be acceptable? Even the latter is still useful for broader analyses, e.g. is the level of participation in rural areas of Russia or China comparable with population density or not?
The rounding of degrees to any precision has nothing to do determining state/region. That would be MaxMind's algorithm which still has the full precision available. Our concern is what will be stored on disk after ip>geo has been done.
--
As for opt-in I'm somewhat skeptical that would give much credibility to the numbers. People in some countries (or even states) will opt-in much easier than in other countries. Or we would need to correct those figures because we can measure opt-in rate per region. Hmm, maybe, but complicated. I'm not sure we have the resources for this, unlike Google.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, August 14, 2013 2:58 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)
I'm not sure that would always provide the safety we're looking for. Because police work, by a nefarious agent in a city with 100,000 people, would quite easily lead to the identity of a specific editor.
As for latitude/longitude, again, these should be rounded on purpose.
If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Unfortunately, I think 30 miles would not provide enough anonymity in China because in some 30 mile areas there may only be a few small villages. Also unfortunately 30 miles would not provide the accuracy that James needs to capture Washington D.C. activity, because any log line would show up in Maryland, Virginia, and D.C. simultaneously.
I think we have to turn this request on its head a little bit and think about the people who are going to be potentially identified. We somehow have to get their permission to analyze this data. If you look at any other geo-analysis being performed by Apple, Google, etc. this is not unusual - they always ask permission from the end user being tracked. We could ask permission in the same way, but maybe find a way to be less creepy than the typical Google approach.
(chiming in late on this thread)
James's original request would need to be better qualified in order to be correctly answered. We should have a separate conversation on what's acceptable and what isn't in terms of releasing anonymized/aggregate pageview or edit activity geodata (in fact, we're working on guidelines for data publication with Legal as part of the Privacy Policy overhaul), but publishing aggregate stats on retention or activity for sub-state level cohorts is in principle possible without incurring into privacy sensitive problems.
For example, if we were to compare country or state-level cohorts of registered users, I don't see major issues releasing aggregate metrics like the "median edit rate" or the "proportion of blocks" or the "24h activation rate", excluding cohorts with less than N users (where the data wouldn't be particularly useful) and without disclosing raw editor counts, which would allow individual user identification. I don't know if that's what James is asking for, but I'd be interested in knowing, for one, if specific regions have a higher than average rate of early activity among registered users on a per-project basis [1]
IP addresses for contributions by registered users are stored privately in the RecentChanges table. It's private data subject to our privacy policy, which means it is accessible to community members with CheckUser rights but also to WMF staff for analytics/operations.
Dario
[1] http://toolserver.org/~dartar/dashboards/metrics/threshold/
On Aug 14, 2013, at 6:55 AM, Erik Zachte ezachte@wikimedia.org wrote:
I can understand both caveats of data being still too specific: city > 100k, lat/long rounded 0.5 degree. Of course any information (even west vs east hemisphere) is a least of some help to a nefarious agent, so it's more a matter of how much exposure is deemed acceptable risk.
Would a city > 500 k be acceptable? Would rounding to 1 degree be acceptable? Even the latter is still useful for broader analyses, e.g. is the level of participation in rural areas of Russia or China comparable with population density or not?
The rounding of degrees to any precision has nothing to do determining state/region. That would be MaxMind's algorithm which still has the full precision available. Our concern is what will be stored on disk after ip>geo has been done.
--
As for opt-in I'm somewhat skeptical that would give much credibility to the numbers. People in some countries (or even states) will opt-in much easier than in other countries. Or we would need to correct those figures because we can measure opt-in rate per region. Hmm, maybe, but complicated. I'm not sure we have the resources for this, unlike Google.
Erik
From: analytics-bounces@lists.wikimedia.org[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, August 14, 2013 2:58 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)
I'm not sure that would always provide the safety we're looking for. Because police work, by a nefarious agent in a city with 100,000 people, would quite easily lead to the identity of a specific editor.
As for latitude/longitude, again, these should be rounded on purpose. If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Unfortunately, I think 30 miles would not provide enough anonymity in China because in some 30 mile areas there may only be a few small villages. Also unfortunately 30 miles would not provide the accuracy that James needs to capture Washington D.C. activity, because any log line would show up in Maryland, Virginia, and D.C. simultaneously.
I think we have to turn this request on its head a little bit and think about the people who are going to be potentially identified. We somehow have to get their permission to analyze this data. If you look at any other geo-analysis being performed by Apple, Google, etc. this is not unusual - they always ask permission from the end user being tracked. We could ask permission in the same way, but maybe find a way to be less creepy than the typical Google approach. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Diederik:
, I think that two issues get mixed up here: geocoding of readers and geocoding of editors.
Not sure why you say that. I mention editors as well, not readers.
I don't think we should get too hung up on the specific format right now, I am really not sure if a composite field is the best implementation and at what level we want to geocode.
That was the format we more or less settled on July 2012. I am just reiterating. Of course this is not cast in stone.
The idea then was to provide all these subfields if available on all traffic. That is if performance allows, again performance is more an issue for state with its irregular boundaries than for city where Maxmind probably just does simple arithmetic calculating distance from nearest city center.
As for which data to provide: other people will want to see data broken down differently. We got requests to analyze data from India and compare major cities. By doing geo-Ip on all traffic and providing all geo data we can supply efficiently we do this just once for all stakeholders .
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Tuesday, August 13, 2013 4:54 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
On Aug 13, 2013, at 9:23 AM, Diederik van Liere dvanliere@wikimedia.org wrote:
On Mon, Aug 12, 2013 at 6:46 PM, Erik Zachte ezachte@wikimedia.org wrote:
Some thought on this:
We have been discussing adding new geo data for a long time.
I lost track of current status and latest decisions but FWIW a year ago this was the idea for squid log:
We thought of replacing ip address by a composite field (using a different delimiter than the field delimiter).
The field could look like this:
4|hash code|CL||Santiago|-33.5,-70.5
6|hash code|US|CA|San Francisco|-37.5,122.5
Where 4 or 6 is the #triplets in ip address.
Hash code is anonimized ip address.
Country code as used by MaxMind ( http://dev.maxmind.com/geoip/legacy/codes/iso3166/ )
Region/state when available or else empty string (*)
City name when available or else empty string ( http://www.maxmind.com/GeoIPCity-534-Location.csv )
Lastly follow latitude/longitude, rounded on purpose. This gives resolution of at best 55 km or 30 mi resolution, depending on latitude, to ensure anonimization particularly for edits. Otherwise a very active editor in a sparsely populated region of say China could easily be matched with edit timestamps from dumps.
I don't think we should get too hung up on the specific format right now, I am really not sure if a composite field is the best implementation and at what level we want to geocode. But more importantly, I think that two issues get mixed up here: geocoding of readers and geocoding of editors.
It was my understanding that the original request pertained to geocoding of editors (if that's not the case then my advance apologies).
@James: can you confirm that we are talking about geocoding of editors?
D
That is correct. Also, if it helps, I don't necessarily need *city*-level information, just state. (For the purposes of this discussion, DC is a state since its stats would not be aggregated with any other state's.)
James
* Caveat:
Supplying region code requires 'external lookup' as MaxMind puts it. ( http://www.maxmind.com/en/city )
This is probably a costly operation.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Sunday, August 11, 2013 1:55 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
That will work. Cheers!
On Aug 10, 2013, at 9:21 AM, Toby Negrin wrote:
Hi James,
We can take a look at this -- the next step for WikiMetrics is to expand the reporting capabilities. The developer with the most context is out until Wednesday; we should be able to get back to you by the end of the week with an estimate of how difficult it would be to implement this changes.
Will that work?
-Toby
On Sat, Aug 10, 2013 at 4:07 AM, Wikimedia DC james.hare@wikidc.org wrote:
Greetings,
I am James Hare, president of the Washington, DC chapter. At Wikimania I have been learning about the editor retention data the Wikimedia Foundation has been collecting and analyzing. I was discussing it with Ryan Kaldari and he noted that while the data was available at the national level, it was not yet available at the state level.
How difficult would it be to implement state-level analysis? Would it just be a matter of simply changing the geolocation lookup code, or would it be a very expensive change that would benefit relatively few people? For Wikimedia DC's sake I am interested in data for the District of Columbia, Maryland, Delaware, Virginia, and West Virginia (our defined chapter region).
Regards, James Hare _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics