Leila Zia wrote:
... we are not aware of any reader logs being shipped out of the WMF servers.
Page 20 of http://infolab.stanford.edu/~west1/pubs/West_Dissertation-2016.pdf says, "We have access to Wikimedia’s full server logs, containing all HTTP requests to Wikimedia projects." Page 19 indicates that this information includes the "IP address, proxy information, and user agent."
At https://youtu.be/jQ0NPhT-fsE&t=25m40s Dr. West says, "we have the complete ... server logs from Wikipedia ... about 14 terabytes of raw logs per month."
If this does not imply that the logs are copied from Foundation servers, that is certainly advantageous over the apparent meaning of the language used. But I question whether recording the personally identifying data in the first place is wise.
I understand that there are currently two other university research laboratories which have similar access. Is that correct?
Would anyone in the Foundation have any way to know whether any of the researchers with access are subject to National Security Letters, a subpoena from a US or foreign law enforcement agency, or blackmail, extortion, or bribery, for that matter?
Is creating the MD5 has described on page 19 of Dr. West's dissertation after filtering bots from the user agents and discarding the IP address before ever storing the log files to disk an appropriate solution to this problem?
Should SHA-512 be used instead of MD5?
James Salsman wrote:
If this does not imply that the logs are copied from Foundation servers, that is certainly advantageous over the apparent meaning of the language used.
Reading the links you provided, and Robert West's acknowledgements which you did not link to, the above strikes me as being creation of drama as opposed to asking a question assuming good faith. Since Robert West had a Wikimedia Fellowship 1), I assume that he was able to analyze data from Wikipedia directly and that no transfer data outside of the WMF has taken place. I'm sure Leila Zia is able to clarify.
Regards, Thyge - Sir48
Hi James,
If this does not imply that the logs are copied from Foundation servers,
that is certainly advantageous over the apparent meaning of the language used.
I am saddened to see that – instead of asking (legitimate) questions to clarify how data is collected and shared – you are assuming bad faith, publicly undermining people across multiple teams at Wikimedia – Security, Legal, Analytics and Research – whose job is to protect the data the WMF collects for a variety of research and operational purposes.
Let me briefly reinstate what Leila said earlier: the terms of our formal collaborations https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations prohibit the sharing of any raw data containing PII (such as webrequest logs) outside of WMF operated servers, as well as the retention of any such data past our data retention period https://meta.wikimedia.org/wiki/Data_retention_guidelines. If you have any substantiated concerns about the collection, retention, or sharing of data for the purpose of this or other projects, I invite you to follow Leila's advice and file a request.
But I question whether recording the personally identifying data in the
first place is wise.
Our privacy policy https://meta.wikimedia.org/wiki/Privacy_policy explains in detail what WMF considers PII and how we collect it. If you have questions about the collection and retention of PII, please post them here https://meta.wikimedia.org/wiki/Talk:Privacy_policy. All data that's collected by the WMF is transparently documented on Wikitech http://wikitech.wikimedia.org/wiki/Analytics.
I understand that there are currently two other university research laboratories
which have similar access. Is that correct?
Current formal collaborations under an NDA are documented on this page https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#Current_collaborations (see also our FAQ on Meta https://meta.wikimedia.org/wiki/Research:FAQ). Specifics of data collection and analysis are described on the corresponding project page.
Dario
On Tue, Nov 8, 2016 at 2:01 AM, Thyge ltl.privat@gmail.com wrote:
James Salsman wrote:
If this does not imply that the logs are copied from Foundation servers, that is certainly advantageous over the apparent meaning of the language used.
Reading the links you provided, and Robert West's acknowledgements which you did not link to, the above strikes me as being creation of drama as opposed to asking a question assuming good faith. Since Robert West had a Wikimedia Fellowship 1), I assume that he was able to analyze data from Wikipedia directly and that no transfer data outside of the WMF has taken place. I'm sure Leila Zia is able to clarify.
Regards, Thyge - Sir48
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
wikimedia-l@lists.wikimedia.org