Re: [Analytics] use cases for the raw IP data stored on the webrequest data

List overview All Threads
Download

newer

older

Reminder: Share your Wikimedia...

Re: [Analytics] Wikimedia page...

James Salsman

29 Jan 2017 29 Jan '17

7:45 a.m.

I've added the following unanswered questions at

https://wikitech.wikimedia.org/wiki/Talk:Analytics/Data/Webrequest/RawIPUsag...

1. Is the ability to rerun metrics more important than protecting reader privacy?

2. On what basis is the decision on the previous question made, or if there is no decision on the question yet, who has the authority to establish that basis?

3. Can Ops use access logs in which the article names have never been stored on permanent, non-RAM media?

4. Can the users who require logs of article names use those in which the IP address, proxy information, and geolocation has never been stored on permanent media?

Show replies by date

Dan Garry

29 Jan 29 Jan

11:54 p.m.

New subject: use cases for the raw IP data stored on the webrequest data

James,

Given that this post continues a pattern I have observed in you of asking loaded questions, repeating the same questions over and over, and assumptions of bad faith, I would not be surprised if people do not find it to be a fruitful use of their time to engage with you. You may wish to consider this for the future.

On 28 January 2017 at 19:45, James Salsman jsalsman@gmail.com wrote:

...

Is the ability to rerun metrics more important than protecting

reader privacy?

This is a classic example of a loaded question https://en.wikipedia.org/wiki/Loaded_question. You also asked almost this exact same question two months ago https://lists.wikimedia.org/pipermail/analytics/2016-November/005519.html, and received several answers. What is the reason for you asking almost exactly the same question again?

...

On what basis is the decision on the previous question made, or if

there is no decision on the question yet, who has the authority to establish that basis?

On what basis was the decision made to have logs? Nuria already answered this question https://lists.wikimedia.org/pipermail/analytics/2016-November/005524.html. They are essential for debugging problems and understanding our readership so we can serve their needs better. To quote Nuria, who was even repeating herself when she said this, "Again, repeating myself: could we make this 60 days interval slightly smaller? Yes, probably, a bit. Could we do without short term retention of raw IPs? No, not really." Again, I am unsure what you hope to learn by repeating this question.

...

Can Ops use access logs in which the article names have never been

stored on permanent, non-RAM media?

Can the users who require logs of article names use those in which

the IP address, proxy information, and geolocation has never been stored on permanent media?

The implicit assumption here is that reasonable means are not being taken to safeguard user data by Technical Operations, which is not a good way to start a conversation. You have also made other technical assumptions, such as that one can only use volatile storage to safely store data. I suggest you trust the expertise of Technical Operations, since these questions appear to show you do not posses such expertise yourself.

Dan

James Salsman

30 Jan 30 Jan

6:31 a.m.

New subject: use cases for the raw IP data stored on the webrequest data

Dan,

I missed this reply in November to which you referred:

...

...
Do the advantages of keeping unanonymized IP reader logs for potential debugging needs outweigh the privacy disadvantages?

Judging from prior postings to this list the community members interest in correctness of pageview data, pageview tools and pageview API far outweights the concerns with a 60 day retention of raw IPs.

Is that the official position of the Foundation? It is has been explicitly contradicted by the Executive Director, and is not considered an acceptable practice by the EFF:

https://www.eff.org/pages/eff-ad-wired

or the American Library Association:

http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy

http://www.ala.org/advocacy/library-privacy-guidelines-data-exchange-between...

http://www.ala.org/advocacy/privacyconfidentiality

or this law review article:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

or news media expose articles such as:

https://www.washingtonpost.com/news/the-switch/wp/2016/10/11/facebook-twitte...

...

...

Can Ops use access logs in which the article names have never been

stored on permanent, non-RAM media?

Can the users who require logs of article names use those in which

the IP address, proxy information, and geolocation has never been stored on permanent media?

The implicit assumption here is that reasonable means are not being taken to safeguard user data by Technical Operations

Such measures do not address the subpoena-related concerns of the EEF, the ALA, the law review article, or the news media expose. Furthermore, it has been shown that the reader data leaves the custody of Technical operations on page 20 here:

http://infolab.stanford.edu/~west1/pubs/West_Dissertation-2016.pdf

That says, "We have access to Wikimedia’s full server logs, containing all HTTP requests to Wikimedia projects." Page 19 indicates that this information includes the "IP address, proxy information, and user agent." See also:

https://youtu.be/jQ0NPhT-fsE&t=25m40s

...

You have also made other technical assumptions, such as that one can only use volatile storage to safely store data.

On the contrary, the assumption is that it's safer to not store PII on nonvolatile storage if it can be associated with the names of articles being read.

If a GET web request comes in from a reader, and the article name is stored in one disk file with the time accurate to the hour, and the IP and proxy information with an exact timestamp is stored in another file, would that meet all of the Foundation's and research community's needs?

2897

Age (days ago)

2898

Last active (days ago)

analytics@lists.wikimedia.org

2 comments

2 participants

tags (0)

participants (2)

Dan Garry
James Salsman