Re: [Analytics] use cases for the raw IP data stored on the webrequest data

30 Jan 2017

Dan,

I missed this reply in November to which you referred:

...
   Do the
advantages of keeping unanonymized IP reader logs for potential
 debugging needs outweigh the privacy disadvantages? 
 Judging from prior postings to this list the community members interest
 in correctness of pageview data, pageview tools and pageview API far
 outweights the concerns with a 60 day retention of raw IPs. 
Is that the official position of the Foundation? It is has been
explicitly contradicted by the Executive Director, and is not
considered an acceptable practice by the EFF:

https://www.eff.org/pages/eff-ad-wired

or the American Library Association:

http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy

http://www.ala.org/advocacy/library-privacy-guidelines-data-exchange-betwee…

http://www.ala.org/advocacy/privacyconfidentiality

or this law review article:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

or news media expose articles such as:

https://www.washingtonpost.com/news/the-switch/wp/2016/10/11/facebook-twitt…

...
   3. Can Ops use
access logs in which the article names have never been
 stored on permanent, non-RAM media?

 4. Can the users who require logs of article names use those in which
 the IP address, proxy information, and geolocation has never been
 stored on permanent media? 
 The implicit assumption here is that reasonable means are not being taken
 to safeguard user data by Technical Operations 
Such measures do not address the subpoena-related concerns of the EEF,
the ALA, the law review article, or the news media expose.
Furthermore, it has been shown that the reader data leaves the custody
of Technical operations on page 20 here:

http://infolab.stanford.edu/~west1/pubs/West_Dissertation-2016.pdf

That says, "We have access to Wikimedia’s full server logs, containing
all HTTP requests to Wikimedia projects." Page 19 indicates that this
information includes the "IP address, proxy information, and user agent."
See also:

https://youtu.be/jQ0NPhT-fsE&t=25m40s

...
  You have also made other technical assumptions, such
 as that one can only use volatile storage to safely store data. 
On the contrary, the assumption is that it's safer to not store PII on
nonvolatile storage if it can be associated with the names of articles
being read.

If a GET web request comes in from a reader, and the article name is
stored in one disk file with the time accurate to the hour, and the IP
and proxy information with an exact timestamp is stored in another
file, would that meet all of the Foundation's and research community's
needs?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] use cases for the raw IP data stored on the webrequest data