Would it relieve some of the concerns if we limited publishing of subnational data to particularly large countries, like the United States, and particularly large projects, like the English Wikipedia?
The size of the project is irrelevant. Even on wp:en it would be rather trivial to find the geo data for any very active editor, by matching timestamps in the squid log with timestamps in the dump or recent changes list. Of course we don't publish squid logs. But let us assess risk when data do leak or are exposed otherwise. Then it is important those geo data are *sufficiently non-specific*. For me that's the issue we should focus on.
--
The city names which MaxMind keeps track of is a limited list ( http://www.maxmind.com/GeoIPCity-534-Location.csv ) Of course it may expand.
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)
So we could build a white list from it which expands over time. Of course that would be another lookup.
As for latitude/longitude, again, these should be rounded on purpose.
If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of James Hare Sent: Wednesday, August 14, 2013 12:13 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
On Aug 13, 2013, at 6:06 PM, Luis Villa lvilla@wikimedia.org wrote:
On Tue, Aug 13, 2013 at 1:45 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
And we already have some aggregated data about editors on stats.wikimedia.org squid repots, so it's surely not a privacy issue.
I'd be worried about using aggregation as a cureall, when, as others have pointed out, we have some very small wikis. But it can be done, especially when you check to make sure that (at whatever granularity you use for the geodata and timestamps) the resulting aggregated sets are always reasonably large.
Luis
Would it relieve some of the concerns if we limited publishing of subnational data to particularly large countries, like the United States, and particularly large projects, like the English Wikipedia?
James