Dario,
I assumed that when an affiliated researcher apart from Foundation staff says, "we have the complete server logs for Wikipedia," amounting to 17 terabytes per month, that means they possess the information. I am glad to be wrong about that, but I object to the implication that such an assumption based on the plain language of the statement could possibly be made in bad faith.
the terms of our formal collaborations https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations prohibit the sharing of any raw data containing PII (such as webrequest logs) outside of WMF operated servers,
There is nothing on that page which suggests that prohibition.
as well as the retention of any such data past our data retention period https://meta.wikimedia.org/wiki/Data_retention_guidelines
That page says, "Information (including personal information) collected through participation in a survey or other research conducted by the Wikimedia Foundation will be retained indefinitely for educational, development, or other related purposes, unless otherwise indicated in the privacy policy or statement of such survey or research."
https://meta.wikimedia.org/w/index.php?title=Talk:2016_Strategy/Draft_WMF_St... says that the Foundation's standard research NDAs include an "obligation to return or destroy any copies of confidential information the individual may have upon request by WMF"
Does that not imply that such copies are allowed in general?
I hope we can move forward to a solution to the general problem.
Is there any legitimate research or any other need to save IP addresses associated with HTTP GET web logs to disk prior to creating a secure hash of them?
On Tue, Nov 8, 2016 at 9:10 AM, James Salsman jsalsman@gmail.com wrote:
I assumed that when an affiliated researcher apart from Foundation staff says, "we have the complete server logs for Wikipedia," amounting to 17 terabytes per month, that means they possess the information. I am glad to be wrong about that, but I object to the implication that such an assumption based on the plain language of the statement could possibly be made in bad faith.
I am glad we cleared that confusion.
the terms of our formal collaborations https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations prohibit the sharing of any raw data containing PII (such as webrequest logs) outside of WMF operated servers,
There is nothing on that page which suggests that prohibition.
You're correct that that document doesn't describe in detail the data access process. When we start a formal collaboration under an NDA, we have an onboarding process that gives researchers restricted access to our cluster, covers server access responsibilities and best practices around the handling of private data. I'll check with our Legal and Security team if we can better document this process.
as well as the retention of any such data past our data retention period https://meta.wikimedia.org/wiki/Data_retention_guidelines
That page says, "Information (including personal information) collected through participation in a survey or other research conducted by the Wikimedia Foundation will be retained indefinitely for educational, development, or other related purposes, unless otherwise indicated in the privacy policy or statement of such survey or research."
This is for surveys requesting explicit (*opt in*) consent to collect and retain specific types of data (such as demographic information) from participants, not for data collected by default via our webrequest logs. Webrequest logs and instrumentation data is purged/sanitized by default within a the 90-day retention window, most often the data sits on our servers for a much shorter time and is removed in a shorter time frame.
https://meta.wikimedia.org/w/index.php?title=Talk:2016_ Strategy/Draft_WMF_Strategy&diff=15467086&oldid=15466763 says that the Foundation's standard research NDAs include an "obligation to return or destroy any copies of confidential information the individual may have upon request by WMF"
Does that not imply that such copies are allowed in general?
IANAL so I can't comment on that but I believe this is a clause that's part of our NDA to avoid confidential information (not specifically PII) to be retained by third parties past the terms of the NDA.
I hope we can move forward to a solution to the general problem.
Is there any legitimate research or any other need to save IP addresses associated with HTTP GET web logs to disk prior to creating a secure hash of them?
these are considerations that the analytics / ops team are best suited to answer, I encourage you to relay them to analytics-l if you want to have a more technical discussion.
HTH, Dario
If you want to hear about the results of this research collaboration, or have additional questions about the data collection approach or the analysis, I invite you to come and join us at our upcoming showcase on *Wednesday 11/16. *
https://lists.wikimedia.org/pipermail/analytics/2016-November/005504.html
On Tue, Nov 8, 2016 at 10:42 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
On Tue, Nov 8, 2016 at 9:10 AM, James Salsman jsalsman@gmail.com wrote:
I assumed that when an affiliated researcher apart from Foundation staff says, "we have the complete server logs for Wikipedia," amounting to 17 terabytes per month, that means they possess the information. I am glad to be wrong about that, but I object to the implication that such an assumption based on the plain language of the statement could possibly be made in bad faith.
I am glad we cleared that confusion.
the terms of our formal collaborations https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations prohibit the sharing of any raw data containing PII (such as webrequest logs) outside of WMF operated servers,
There is nothing on that page which suggests that prohibition.
You're correct that that document doesn't describe in detail the data access process. When we start a formal collaboration under an NDA, we have an onboarding process that gives researchers restricted access to our cluster, covers server access responsibilities and best practices around the handling of private data. I'll check with our Legal and Security team if we can better document this process.
as well as the retention of any such data past our data retention period https://meta.wikimedia.org/wiki/Data_retention_guidelines
That page says, "Information (including personal information) collected through participation in a survey or other research conducted by the Wikimedia Foundation will be retained indefinitely for educational, development, or other related purposes, unless otherwise indicated in the privacy policy or statement of such survey or research."
This is for surveys requesting explicit (*opt in*) consent to collect and retain specific types of data (such as demographic information) from participants, not for data collected by default via our webrequest logs. Webrequest logs and instrumentation data is purged/sanitized by default within a the 90-day retention window, most often the data sits on our servers for a much shorter time and is removed in a shorter time frame.
https://meta.wikimedia.org/w/index.php?title=Talk:2016_Strat egy/Draft_WMF_Strategy&diff=15467086&oldid=15466763 says that the Foundation's standard research NDAs include an "obligation to return or destroy any copies of confidential information the individual may have upon request by WMF"
Does that not imply that such copies are allowed in general?
IANAL so I can't comment on that but I believe this is a clause that's part of our NDA to avoid confidential information (not specifically PII) to be retained by third parties past the terms of the NDA.
I hope we can move forward to a solution to the general problem.
Is there any legitimate research or any other need to save IP addresses associated with HTTP GET web logs to disk prior to creating a secure hash of them?
these are considerations that the analytics / ops team are best suited to answer, I encourage you to relay them to analytics-l if you want to have a more technical discussion.
HTH, Dario
wikimedia-l@lists.wikimedia.org