I can understand both caveats of data being still too specific: city > 100k, lat/long rounded 0.5 degree. Of course any information (even west vs east hemisphere) is a least of some help to a nefarious agent, so it's more a matter of how much exposure is deemed acceptable risk.
Would a city > 500 k be acceptable? Would rounding to 1 degree be acceptable? Even the latter is still useful for broader analyses, e.g. is the level of participation in rural areas of Russia or China comparable with population density or not?
The rounding of degrees to any precision has nothing to do determining state/region. That would be MaxMind's algorithm which still has the full precision available. Our concern is what will be stored on disk after ip>geo has been done.
--
As for opt-in I'm somewhat skeptical that would give much credibility to the numbers. People in some countries (or even states) will opt-in much easier than in other countries. Or we would need to correct those figures because we can measure opt-in rate per region. Hmm, maybe, but complicated. I'm not sure we have the resources for this, unlike Google.
Erik
From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, August 14, 2013 2:58 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)
I'm not sure that would always provide the safety we're looking for. Because police work, by a nefarious agent in a city with 100,000 people, would quite easily lead to the identity of a specific editor.
As for latitude/longitude, again, these should be rounded on purpose.
If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Unfortunately, I think 30 miles would not provide enough anonymity in China because in some 30 mile areas there may only be a few small villages. Also unfortunately 30 miles would not provide the accuracy that James needs to capture Washington D.C. activity, because any log line would show up in Maryland, Virginia, and D.C. simultaneously.
I think we have to turn this request on its head a little bit and think about the people who are going to be potentially identified. We somehow have to get their permission to analyze this data. If you look at any other geo-analysis being performed by Apple, Google, etc. this is not unusual - they always ask permission from the end user being tracked. We could ask permission in the same way, but maybe find a way to be less creepy than the typical Google approach.