Re: [Analytics] ensuring reader anonymity

11 Nov 2016


      Nuria, regarding the IP addresses specifically (not the proxy, for which,
I'll need more time to go through the use-cases we've had and see if we can
find work-arounds if we hash proxy information):
Have we considered in the past to create at least two levels of access when
it comes to the IP addresses? From what you describe, it is clear to me
that your team will need to have access to raw IPs for a certain period of
time. It may be the case that no one else uses that information (for all of
the use-cases of the research I've been involved in, hashed IP works as
well, as long as we have geolocation available to us). By creating two
layers of access, we can make sure that your team has access to raw IP
while everyone else doesn't. Is this an option?
And one suggestion: if we want to reconsider the way we provide access to
IP address, I'd like to suggest that we step back and reconsider the way we
give access to other fields in the webrequest logs as well. This will be a
longer process, but it may be worthwhile. For example, if we decide that
access to raw IP should be limited even further, do we want to have the
same restrictions applied to access to UAs? It's not obvious to me that the
answer should be no.
Best,
Leila
On Fri, Nov 11, 2016 at 8:31 AM, Nuria Ruiz nuria@wikimedia.org wrote:
...
...
I support any decrease of the storage of plain IP addresses. See also <
https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com
ment/Structured_logging/IP_address_and_other_personal_identi
fying_information> for more references.
To be clear: on our end we need buffer time that allows us to know that
should there be a bug we can reprocess pageviews if needed (this does
happen). That buffer time is now 60 days and perhaps it could be a bit
smaller but it is still going to be a matter of weeks, not days for which
the raw data needs to be available. As mentione earlier in the thread we
need raw IPs to geolocate requests, once that is done IPs are discarded.
On Fri, Nov 11, 2016 at 12:00 AM, Federico Leva (Nemo) <nemowiki@gmail.com
...
wrote:
...
Dan Andreescu, 10/11/2016 16:00:
...
I don't have as clear a reason for why we store the plain IP in
webrequest.  I think we could count uniques and all that other stuff
with the IP hash.  It's a good question, tentative +1 unless I'm
forgetting something.
I support any decrease of the storage of plain IP addresses. See also <
https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com
ment/Structured_logging/IP_address_and_other_personal_identi
fying_information> for more references.
Nemo

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] ensuring reader anonymity