I understand that there are many other projects in the pipeline. I don't know where this one would fall in the list of priorities; it does make sense to me that the Wikistats transition would be a higher priority. If it turns out that there is a poor cost-to-benefit ratio from diving deeply into this issue, then by all means move on. My biggest concern (which may be different from the concerns of James and others) is not so much about the length of time that the logs are retained (60 or 90 days, and possibly less than that for data that can be hashed) but about who has access to them, especially people who are not community functionaries nor WMF staff. Like you, I have other pressing concerns besides this set of issues, so I'll leave my thought here for now and hope that the experts who work in this area will take it into consideration.

To summarize my thinking: I'd encourage exploratory work on tightening what kind of data is retained and (especially) who has access to it, and if that exploratory work suggests a poor cost-to-benefit ratio for further work at this time, then I'd say moving on to other issues is OK and this can be tabled until there's a good reason to revisit the issue.

Pine


On Fri, Nov 11, 2016 at 11:16 AM, Leila Zia <leila@wikimedia.org> wrote:
​Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W <wiki.pine@gmail.com> wrote:
On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia <leila@wikimedia.org> wrote:
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best,
Leila


I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

* Subpoena related concerns: the best way to handle this from the data storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

* Error related concerns: One way to reduce the errors is to constrain the number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

* I also want to point out prioritization here, which is something Nuria and her team should handle (and this will affect Security, Research, and Legal as well):
 
the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best,
Leila

 

Pine


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics