I can understand both caveats of data being still too specific: city > 100k,
lat/long rounded 0.5 degree.
Of course any information (even west vs east hemisphere) is a least of some
help to a nefarious agent, so it's more a matter of how much exposure is
deemed acceptable risk.
Would a city > 500 k be acceptable? Would rounding to 1 degree be
acceptable? Even the latter is still useful for broader analyses, e.g. is
the level of participation in rural areas of Russia or China comparable with
population density or not?
The rounding of degrees to any precision has nothing to do determining
state/region. That would be MaxMind's algorithm which still has the full
precision available. Our concern is what will be stored on disk after ip>geo
has been done.
--
As for opt-in I'm somewhat skeptical that would give much credibility to the
numbers. People in some countries (or even states) will opt-in much easier
than in other countries. Or we would need to correct those figures because
we can measure opt-in rate per region. Hmm, maybe, but complicated. I'm not
sure we have the resources for this, unlike Google.
Erik
From: analytics-bounces(a)lists.wikimedia.org
[mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dan Andreescu
Sent: Wednesday, August 14, 2013 2:58 PM
To: A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics.
Subject: Re: [Analytics] U.S. state-level editor retention data
We would store it locally like we do with country and continent lookup list,
and could manually vet whether cities are > say 100,000 people)
I'm not sure that would always provide the safety we're looking for.
Because police work, by a nefarious agent in a city with 100,000 people,
would quite easily lead to the identity of a specific editor.
As for latitude/longitude, again, these should be rounded on purpose.
If we round on 0.5 degree, this gives a latitudinal resolution of around 55
km or 30 mi at the equator, and 22 km or 12 mile at the arctic circle.
(Again state or region lookup might be too costly to lookup anyway, but that
is another matter)
Unfortunately, I think 30 miles would not provide enough anonymity in China
because in some 30 mile areas there may only be a few small villages. Also
unfortunately 30 miles would not provide the accuracy that James needs to
capture Washington D.C. activity, because any log line would show up in
Maryland, Virginia, and D.C. simultaneously.
I think we have to turn this request on its head a little bit and think
about the people who are going to be potentially identified. We somehow
have to get their permission to analyze this data. If you look at any other
geo-analysis being performed by Apple, Google, etc. this is not unusual -
they always ask permission from the end user being tracked. We could ask
permission in the same way, but maybe find a way to be less creepy than the
typical Google approach.