(chiming in late on this thread)
James's original request would need to be better qualified in order to be correctly answered. We should have a separate conversation on what's acceptable and what isn't in terms of releasing anonymized/aggregate pageview or edit activity geodata (in fact, we're working on guidelines for data publication with Legal as part of the Privacy Policy overhaul), but publishing aggregate stats on retention or activity for sub-state level cohorts is in principle possible without incurring into privacy sensitive problems.
For example, if we were to compare country or state-level cohorts of registered users, I don't see major issues releasing aggregate metrics like the "median edit rate" or the "proportion of blocks" or the "24h activation rate", excluding cohorts with less than N users (where the data wouldn't be particularly useful) and without disclosing raw editor counts, which would allow individual user identification. I don't know if that's what James is asking for, but I'd be interested in knowing, for one, if specific regions have a higher than average rate of early activity among registered users on a per-project basis [1]
IP addresses for contributions by registered users are stored privately in the RecentChanges table. It's private data subject to our privacy policy, which means it is accessible to community members with CheckUser rights but also to WMF staff for analytics/operations.
Dario
[1] http://toolserver.org/~dartar/dashboards/metrics/threshold/
On Aug 14, 2013, at 6:55 AM, Erik Zachte ezachte@wikimedia.org wrote:
I can understand both caveats of data being still too specific: city > 100k, lat/long rounded 0.5 degree. Of course any information (even west vs east hemisphere) is a least of some help to a nefarious agent, so it's more a matter of how much exposure is deemed acceptable risk.
Would a city > 500 k be acceptable? Would rounding to 1 degree be acceptable? Even the latter is still useful for broader analyses, e.g. is the level of participation in rural areas of Russia or China comparable with population density or not?
The rounding of degrees to any precision has nothing to do determining state/region. That would be MaxMind's algorithm which still has the full precision available. Our concern is what will be stored on disk after ip>geo has been done.
--
As for opt-in I'm somewhat skeptical that would give much credibility to the numbers. People in some countries (or even states) will opt-in much easier than in other countries. Or we would need to correct those figures because we can measure opt-in rate per region. Hmm, maybe, but complicated. I'm not sure we have the resources for this, unlike Google.
Erik
From: analytics-bounces@lists.wikimedia.org[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu Sent: Wednesday, August 14, 2013 2:58 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] U.S. state-level editor retention data
We would store it locally like we do with country and continent lookup list, and could manually vet whether cities are > say 100,000 people)
I'm not sure that would always provide the safety we're looking for. Because police work, by a nefarious agent in a city with 100,000 people, would quite easily lead to the identity of a specific editor.
As for latitude/longitude, again, these should be rounded on purpose. If we round on 0.5 degree, this gives a latitudinal resolution of around 55 km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that is another matter)
Unfortunately, I think 30 miles would not provide enough anonymity in China because in some 30 mile areas there may only be a few small villages. Also unfortunately 30 miles would not provide the accuracy that James needs to capture Washington D.C. activity, because any log line would show up in Maryland, Virginia, and D.C. simultaneously.
I think we have to turn this request on its head a little bit and think about the people who are going to be potentially identified. We somehow have to get their permission to analyze this data. If you look at any other geo-analysis being performed by Apple, Google, etc. this is not unusual - they always ask permission from the end user being tracked. We could ask permission in the same way, but maybe find a way to be less creepy than the typical Google approach. _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics