​Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W <wiki.pine@gmail.com> wrote:
On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia <leila@wikimedia.org> wrote:
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best,
Leila


I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

* Subpoena related concerns: the best way to handle this from the data storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

* Error related concerns: One way to reduce the errors is to constrain the number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

* I also want to point out prioritization here, which is something Nuria and her team should handle (and this will affect Security, Research, and Legal as well):
 
the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best,
Leila

 

Pine


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics