We need to anonymize both IP addresses and proxy
information with a secure
hash if we want to keep each GET request's
geolocation, to be compliant
with the Privacy Policy.
Maybe this is not clear, raw IPS are not kept once geolocalization is done.
IPs are discarded and geolocation info is the one kept long term.
The Privacy Policy is the most prominent policy on the
far left on the
footer of every page served by
every editable project, and says explicitly that
consent is required for
the use of geolocation.
The privacy policy talks about client side geo location to offer you
geo-specific features on the client side, which is an entirely different
topic of what we are taking about here. IP addresses are going to be sent
via HTTP regardless with your request and the geo location we do (to be
able to report for example pages per country, one of the reports most
sought after by our community) has nothing to do with geolocated features.
Do we have any privacy experts on staff who can give
these issues a
thorough analysis in light of all the issues raised in
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?
Anonymization is hard but thus far none is mentioned doing that, right?
When it comes to IP data, again, we do not kept it long term neither do we
anonymize it with any illusion of privacy, we just discard it as soon as we
can.
You can read on our research regrading anonymization here. This gist of it
is that doing it well is quite hard.
https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/K_Anonym…
If Ops needs IP addresses, they should be able to use
synthetic POST
requests, as far as I can tell. If they anticipate a need for
non-anonymous
GET requests, then perhaps some >kind of a debugging switch which could be
used on a short term basis where an IP range or mask could be entered to
allow matching addresses to log non-anonymously before >expiring in an hour
would solve any anticipated need?
You can bring that up with ops team, I doubt we can operate a website for
hundreds off millions of devices (almost a billion) and troubleshoot
networking issues, DOS and others without having access to raw IPs for a
short period of time. Ops work doesn't need to have access to IP data long
term, just near term.
On Fri, Nov 11, 2016 at 7:11 AM, James Salsman <jsalsman(a)gmail.com> wrote:
Pine wrote:
I tend to think that checkusers will need the plain IP addresses....
I am not suggesting removing the IP addresses or proxy information from
POST requests as checkuser requires.
We need to anonymize both IP addresses and proxy information with a secure
hash if we want to keep each GET request's geolocation, to be compliant
with the Privacy Policy. The Privacy Policy is the most prominent policy on
the far left on the footer of every page served by every editable project,
and says explicitly that consent is required for the use of geolocation.
The Privacy and other policies make it clear that POST requests and Visual
Editor submissions aren't going to be anonymized.
However, geolocations for POST edit and visual editor submissions still
require explicit consent which we have no way to obtain at present.
Editors' geolocations as they edit are very useful for research, but by the
same token have the most serious privacy concerns. Obtaining consent to
store geolocation seems like it would interfere with, complicate, and
disrupt editing. If geolocation is stored with anonymized IP addresses for
GETs but not POSTs or Visual Editor submissions, both could easily be
recovered because of simultaneously interleaved GET and POST requests for
the same article are unavoidable.
Do we have any privacy experts on staff who can give these issues a
thorough analysis in light of all the issues raised in
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?
If Ops needs IP addresses, they should be able to use synthetic POST
requests, as far as I can tell. If they anticipate a need for non-anonymous
GET requests, then perhaps some kind of a debugging switch which could be
used on a short term basis where an IP range or mask could be entered to
allow matching addresses to log non-anonymously before expiring in an hour
would solve any anticipated need?
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics