ensuring reader anonymity

List overview All Threads
Download

newer

older

Pageviews for 11/10 and 11/11

Statsv

James Salsman

8 Nov 2016 8 Nov '16

6:26 p.m.

Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?

Show replies by date

Dan Andreescu

10 Nov 10 Nov

noon

So, just to give context, our HTTP requests take this path:

* varnish log (very small buffer, not permanent) * varnishkafka * kafka (small buffer, I think 7 days) * camus * refine process (we use IPs at this point to geolocate) * webrequest table on hdfs (this is the first time they're stored on permanent media, for 60 days) * other datasets like hourly pageviews aggregates, (IPs are not passed on to these)

So if we wanted to not store them in kafka buffers even, we'd have to give up geolocating. I think a lot of people find this very useful (fundraising, research, ops, reading), so it's unlikely to be removed.

I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something. But even so, it's not so bad, it's only stored for 60 days and we have no other plain IPs anywhere else (like we removed them from Event Logging for example).

On Tue, Nov 8, 2016 at 4:26 PM, James Salsman jsalsman@gmail.com wrote:

...

Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

12:56 p.m.

...

Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?

To expand a bit on Dan's answer. For analytics we need raw IPs to do geo location, which is an important bit of information but other than that we really do not need raw IPs for anything else thus far. It is not unheard of us having to redo our pageview processing due to bugs on code or issues within the pipeline so we need to have raw data available for a certain buffer time.

Now, data needed for ops is a different matter having raw IPs is useful to troubleshoot issues that have to do with connection problems, DOS and others. Normally the work ops does troubleshooting issues with incoming traffic needs IPs to be available for some weeks but not months.

Data retention guidelines are documented here: https://meta.wikimedia.org/wiki/Data_retention_guidelines

On Thu, Nov 10, 2016 at 7:00 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

So, just to give context, our HTTP requests take this path:

varnish log (very small buffer, not permanent)

varnishkafka

kafka (small buffer, I think 7 days)

camus

refine process (we use IPs at this point to geolocate)

webrequest table on hdfs (this is the first time they're stored on

permanent media, for 60 days)

other datasets like hourly pageviews aggregates, (IPs are not passed on

to these)

So if we wanted to not store them in kafka buffers even, we'd have to give up geolocating. I think a lot of people find this very useful (fundraising, research, ops, reading), so it's unlikely to be removed.

I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something. But even so, it's not so bad, it's only stored for 60 days and we have no other plain IPs anywhere else (like we removed them from Event Logging for example).

On Tue, Nov 8, 2016 at 4:26 PM, James Salsman jsalsman@gmail.com wrote:

...
Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Federico Leva (Nemo)

11 Nov 11 Nov

5 a.m.

Dan Andreescu, 10/11/2016 16:00:

...

I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something.

I support any decrease of the storage of plain IP addresses. See also https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_comment/Structured_logging/IP_address_and_other_personal_identifying_information for more references.

Nemo

Pine W

5:06 a.m.

I tend to think that checkusers will need the plain IP addresses. Most other uses might be able to work with a hash or some other kind of abstraction.

Pine

On Fri, Nov 11, 2016 at 12:00 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Dan Andreescu, 10/11/2016 16:00:

...
I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something.

I support any decrease of the storage of plain IP addresses. See also < https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com ment/Structured_logging/IP_address_and_other_personal_identi fying_information> for more references.

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

1:31 p.m.

...

I support any decrease of the storage of plain IP addresses. See also <

https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com ment/Structured_logging/IP_address_and_other_personal_identi fying_information> for more references.

To be clear: on our end we need buffer time that allows us to know that should there be a bug we can reprocess pageviews if needed (this does happen). That buffer time is now 60 days and perhaps it could be a bit smaller but it is still going to be a matter of weeks, not days for which the raw data needs to be available. As mentione earlier in the thread we need raw IPs to geolocate requests, once that is done IPs are discarded.

On Fri, Nov 11, 2016 at 12:00 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Dan Andreescu, 10/11/2016 16:00:

...
I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something.

I support any decrease of the storage of plain IP addresses. See also < https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com ment/Structured_logging/IP_address_and_other_personal_identi fying_information> for more references.

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

2:25 p.m.

Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

On Fri, Nov 11, 2016 at 8:31 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
I support any decrease of the storage of plain IP addresses. See also <

https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com ment/Structured_logging/IP_address_and_other_personal_identi fying_information> for more references.

To be clear: on our end we need buffer time that allows us to know that should there be a bug we can reprocess pageviews if needed (this does happen). That buffer time is now 60 days and perhaps it could be a bit smaller but it is still going to be a matter of weeks, not days for which the raw data needs to be available. As mentione earlier in the thread we need raw IPs to geolocate requests, once that is done IPs are discarded.

On Fri, Nov 11, 2016 at 12:00 AM, Federico Leva (Nemo) <nemowiki@gmail.com

...
wrote:

...
Dan Andreescu, 10/11/2016 16:00:

...
I don't have as clear a reason for why we store the plain IP in webrequest. I think we could count uniques and all that other stuff with the IP hash. It's a good question, tentative +1 unless I'm forgetting something.

I support any decrease of the storage of plain IP addresses. See also < https://www.mediawiki.org/wiki/Thread:Talk:Requests_for_com ment/Structured_logging/IP_address_and_other_personal_identi fying_information> for more references.

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Pine W

3:39 p.m.

...

On Fri, Nov 11, 2016 at 7:36 AM, Marcel Ruiz Forns mforns@wikimedia.org

wrote:

...

Hi Pine,

...

...
I thought that was specified in either the Privacy Policy or Terms of

Use but I can't find the specific reference, and that bothers me.

...

This is specified in the data retention guidelines: https://meta.wikimedia.org/wiki/Data_retention_guidelines

https://meta.wikimedia.org/wiki/Data_retention_guidelines

...

Cheers!

Thanks. Why is that info specified in the Data retention guidelines rather than in the Terms of Use or Privacy Policy? I worry that the retention guidelines require a lower threshold of notice for change than the ToU or PP, and may not have the same degree of legal assurance as the ToU and PP that WMF will abide by the guideline. Could the Data retention guidelines be fully incorporated into the PP and/or ToU?

On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia leila@wikimedia.org wrote:

...

Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

Pine

Leila Zia

4:16 p.m.

Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W wiki.pine@gmail.com wrote:

...

On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia leila@wikimedia.org wrote:

...
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

* Subpoena related concerns: the best way to handle this from the data storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

* Error related concerns: One way to reduce the errors is to constrain the number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

* I also want to point out prioritization here, which is something Nuria and her team should handle (and this will affect Security, Research, and Legal as well):

the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best, Leila

...

Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

4:25 p.m.

And one last email from me until Monday ;)

(Thanks to Nuria) We are now following hashing of IPs in webrequest logs at https://phabricator.wikimedia.org/T150545 . I have asked Nuria to give me 2 weeks to reach out to the people who work with this data to see if anyone raises any flag about hashing IP addresses in webrequest logs.

Best, Leila

On Fri, Nov 11, 2016 at 11:16 AM, Leila Zia leila@wikimedia.org wrote:

...

Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W wiki.pine@gmail.com wrote:

...
On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia leila@wikimedia.org wrote:

...
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

Subpoena related concerns: the best way to handle this from the data

storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

Error related concerns: One way to reduce the errors is to constrain the

number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

I also want to point out prioritization here, which is something Nuria

and her team should handle (and this will affect Security, Research, and Legal as well):

the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best, Leila

...
Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Pine W

4:29 p.m.

I understand that there are many other projects in the pipeline. I don't know where this one would fall in the list of priorities; it does make sense to me that the Wikistats transition would be a higher priority. If it turns out that there is a poor cost-to-benefit ratio from diving deeply into this issue, then by all means move on. My biggest concern (which may be different from the concerns of James and others) is not so much about the length of time that the logs are retained (60 or 90 days, and possibly less than that for data that can be hashed) but about who has access to them, especially people who are not community functionaries nor WMF staff. Like you, I have other pressing concerns besides this set of issues, so I'll leave my thought here for now and hope that the experts who work in this area will take it into consideration.

To summarize my thinking: I'd encourage exploratory work on tightening what kind of data is retained and (especially) who has access to it, and if that exploratory work suggests a poor cost-to-benefit ratio for further work at this time, then I'd say moving on to other issues is OK and this can be tabled until there's a good reason to revisit the issue.

Pine

On Fri, Nov 11, 2016 at 11:16 AM, Leila Zia leila@wikimedia.org wrote:

...

Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W wiki.pine@gmail.com wrote:

...
On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia leila@wikimedia.org wrote:

...
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

Subpoena related concerns: the best way to handle this from the data

storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

Error related concerns: One way to reduce the errors is to constrain the

number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

I also want to point out prioritization here, which is something Nuria

and her team should handle (and this will affect Security, Research, and Legal as well):

the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best, Leila

...
Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

C. Scott Ananian

6:58 p.m.

On Fri, Nov 11, 2016 at 2:16 PM, Leila Zia leila@wikimedia.org wrote:

...

Subpoena related concerns: the best way to handle this from the data

storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

It is worth considering this in context of https://twitter.com/Pinboard/status/797167026481442816

That is, not storing the data is nice, but do we have any plans in place in case a government decides to place a recording device in our data center beside our servers? We may have the best of intentions, but "we don't store it" could in fact be misleading comfort if there is a third-party who *is* storing it.

This is perhaps a broader question (and more in line with James' initial inquiry?), as it suggests that we reconsider what sort of protections we can actually provide to our editors, and make sure they know if we can't protect them from state-level monitoring. --scott

-- (http://cscott.net)

Pine W

11:22 p.m.

Realistically, if a government in a country that hosts one of the WMF data centers decides that they want unfiltered access to the data, I'm not sure how much WMF could do about it. I won't speculate on what kind of defenses WMF might have against that scenario, but I would encourage Analytics, Legal, and Security to have that conversation if they have not already done so. (The US government is not the only government that might engage in this kind of mass surveillance, and such a government may or may not use legal means to accomplish their objectives; other options include various kinds of phishing and social engineering attacks.)

Returning to previous discussions about limiting the number of people who have access to raw IPs and related data, I'm thinking that I like the idea of hashing the data and/or geolocating the data and then giving that processed data to researchers, rather than letting researchers have the raw data. I would be more comfortable with people who are not WMF employees and not community checkusers having access to the processed data than to true IP addresses, UAs, and other similar kinds of data.

Pine

On Fri, Nov 11, 2016 at 1:58 PM, C. Scott Ananian cananian@wikimedia.org wrote:

...

On Fri, Nov 11, 2016 at 2:16 PM, Leila Zia leila@wikimedia.org wrote:

...

Subpoena related concerns: the best way to handle this from the data

storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

It is worth considering this in context of https://twitter.com/ Pinboard/status/797167026481442816

That is, not storing the data is nice, but do we have any plans in place in case a government decides to place a recording device in our data center beside our servers? We may have the best of intentions, but "we don't store it" could in fact be misleading comfort if there is a third-party who *is* storing it.

This is perhaps a broader question (and more in line with James' initial inquiry?), as it suggests that we reconsider what sort of protections we can actually provide to our editors, and make sure they know if we can't protect them from state-level monitoring.

--scott

(http://cscott.net)

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

12 Nov 12 Nov

1:39 a.m.

...

I'm not a supporter of the narrative that non-staff folks who have access

to the webrequest logs should be limited more than staff members who can have access to the logs. Some >of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is

...

experienced in dealing with sensitive data. They understand the importance

of the data they work with, and we do our share in onboarding them and making sure we are all on the >same page about what data they're working with and how they should handle it.

+1, researchers collaborating with research team in data-centered projects are probably much more aware of data and privacy issues than the average WMF person that accesses data just sporadically.

On Fri, Nov 11, 2016 at 11:16 AM, Leila Zia leila@wikimedia.org wrote:

...

Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W wiki.pine@gmail.com wrote:

...
On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia leila@wikimedia.org wrote:

...
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

Subpoena related concerns: the best way to handle this from the data

storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

Error related concerns: One way to reduce the errors is to constrain the

number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

I also want to point out prioritization here, which is something Nuria

and her team should handle (and this will affect Security, Research, and Legal as well):

the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best, Leila

...
Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Pine W

2:16 a.m.

They might be more aware of the issues, but whether they are compliant with them is a different matter. Rather than take the risk, I would prefer that they have access to the semi-anonymized data. The more people have access to data (both unfiltered and semi-anonymized), the more opportunities there are for accidental or intentional wrongful disclosure, unauthorized hoarding, etc. It seems to me that the mitigation strategies we've discussed in this thread look like they might be good from a privacy standpoint; hopefully the cost to implement them is reasonable.

Looping back to something that I think Leila said earlier, I like the idea of allowing internal WMF people access to the unfiltered data on a need-to-know basis also, which might mean that people inside WMF would also use semi-anonymized data, not so much because we don't trust them but because this is mitigation against accidents happening.

Keeping in mind that other projects also need the limited human resources and financial capacity of the WMF teams that these changes would involve, I would not characterize these changes as projects that need to get underway this quarter, but if they could be done this fiscal year I think that would be good.

Pine

On Fri, Nov 11, 2016 at 8:39 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
I'm not a supporter of the narrative that non-staff folks who have access

to the webrequest logs should be limited more than staff members who can have access to the logs. Some >of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is

...
experienced in dealing with sensitive data. They understand the importance

of the data they work with, and we do our share in onboarding them and making sure we are all on the >same page about what data they're working with and how they should handle it.

+1, researchers collaborating with research team in data-centered projects are probably much more aware of data and privacy issues than the average WMF person that accesses data just sporadically.

On Fri, Nov 11, 2016 at 11:16 AM, Leila Zia leila@wikimedia.org wrote:

...
Hi Pine,

On Fri, Nov 11, 2016 at 10:39 AM, Pine W wiki.pine@gmail.com wrote:

...
On Fri, Nov 11, 2016 at 9:25 AM, Leila Zia leila@wikimedia.org wrote:

...
Nuria, regarding the IP addresses specifically (not the proxy, for which, I'll need more time to go through the use-cases we've had and see if we can find work-arounds if we hash proxy information):

Have we considered in the past to create at least two levels of access when it comes to the IP addresses? From what you describe, it is clear to me that your team will need to have access to raw IPs for a certain period of time. It may be the case that no one else uses that information (for all of the use-cases of the research I've been involved in, hashed IP works as well, as long as we have geolocation available to us). By creating two layers of access, we can make sure that your team has access to raw IP while everyone else doesn't. Is this an option?

And one suggestion: if we want to reconsider the way we provide access to IP address, I'd like to suggest that we step back and reconsider the way we give access to other fields in the webrequest logs as well. This will be a longer process, but it may be worthwhile. For example, if we decide that access to raw IP should be limited even further, do we want to have the same restrictions applied to access to UAs? It's not obvious to me that the answer should be no.

Best, Leila

I'd be happy to have Legal and Analytics take a look at what could be done to tighten the screws a bit on who has access to other data in the logs such as UAs. (To follow up on a comment from Wikimedia-l: I'm also very wary of letting people outside of WMF and the community have access to this kind of information, even with a signed NDA.)

I'm not a supporter of the narrative that non-staff folks who have access to the webrequest logs should be limited more than staff members who can have access to the logs. Some of these folks are highly trained individuals (sometimes even more than staff members) and some of them are less experienced but work very closely with a staff member who is experienced in dealing with sensitive data. They understand the importance of the data they work with, and we do our share in onboarding them and making sure we are all on the same page about what data they're working with and how they should handle it.

Let's step back:

Subpoena related concerns: the best way to handle this from the data

storage perspective is to not have the data at all. That is why very sensitive data is purged after 60 days at the moment in webrequest logs. As Nuria said, this length of time may be shortened by a little, but at least because of operational constraints, we won't be able to not store this data at all.

Error related concerns: One way to reduce the errors is to constrain

the number of people who can access the data (which is already happening, we're talking about increasing restrictions here). In this case, there is very little difference between staff and non-staff folks who have access to webrequest logs at the moment. Mistakes can happen by people in each group. I may make a mistake and give an output in my GitHub account with the top 10 IP addresses that have accessed WP in the last hour. This mistake can happen, by anyone accessing this data. The logical thing to do is to reduce the number of people who don't /have to/ have access to that data. If I don't /need/ to see the IPs for my work, I shouldn't see them, whether I'm a staff member or non-staff under NDA to access this data. If I should, then we should accept that mistakes can happen, but we will do our best to reduce them.

I also want to point out prioritization here, which is something Nuria

and her team should handle (and this will affect Security, Research, and Legal as well):

the Analytics team has been allocating resources to transition wikistats. This has been a gigantic endeavour by the team. We know that if wikistats data is not generated for a few months, we will have a lot of unhappy people around. If we are to step back and allocate resources to spend hours on rethinking how we handle webrequest logs (and I can assure you that this will require at least 2 full-time work week of many people, likely spread over months), we will have to slow down some other effort. Also, consider Security who needs to be involved in this process. You know better than I that Security has a lot to do with very few people.

imho, we are doing a very good job with the way we are handling webrequest logs data at the moment given our constraints. Sure, we can and should improve some steps over time.

Best, Leila

...
Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Pine W

11 Nov 11 Nov

4:33 a.m.

By the way, to the best of my knowledge, all recordings to "permanent media" are overwritten or destroyed after 60 days. I thought that was specified in either the Privacy Policy or Terms of Use but I can't find the specific reference, and that bothers me. Can someone at Legal explain why this isn't specified in either in the PP or ToU? (Feel free to fork this question if it becomes a distraction to the original thread.)

Thanks,

Pine

On Tue, Nov 8, 2016 at 1:26 PM, James Salsman jsalsman@gmail.com wrote:

...

Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Wikimedia Legal

14 Nov 14 Nov

11:12 p.m.

Hi Pine,

Are the data retention guidelines https://meta.wikimedia.org/wiki/Data_retention_guidelines what you're looking for? They're linked in the footer on the privacy policy page, but were done as a separate policy after the privacy policy was already in placed.

Best, Jacob

Legal Counsel Wikimedia Foundation

NOTICE: This email message and all attachments transmitted with it are intended solely for the use of the addressees and may contain legally privileged, protected, or confidential information. If you have received this message in error, please notify the sender immediately by email reply and please delete this message from your computer and destroy any copies. You must not copy or disclose the contents of this message or any attachment to any other person.

On Thu, Nov 10, 2016 at 11:33 PM, Pine W wiki.pine@gmail.com wrote:

...

By the way, to the best of my knowledge, all recordings to "permanent media" are overwritten or destroyed after 60 days. I thought that was specified in either the Privacy Policy or Terms of Use but I can't find the specific reference, and that bothers me. Can someone at Legal explain why this isn't specified in either in the PP or ToU? (Feel free to fork this question if it becomes a distraction to the original thread.)

Thanks,

Pine

On Tue, Nov 8, 2016 at 1:26 PM, James Salsman jsalsman@gmail.com wrote:

...
Are there any reasons to not replace HTTP GET request IP addresses and proxy information with their SHA-512 secure hash prior to writing them to permanent media?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

15 Nov 15 Nov

8:41 a.m.

James Salsman

11 Nov 11 Nov

12:11 p.m.

Pine wrote:

...

I tend to think that checkusers will need the plain IP addresses....

I am not suggesting removing the IP addresses or proxy information from POST requests as checkuser requires.

We need to anonymize both IP addresses and proxy information with a secure hash if we want to keep each GET request's geolocation, to be compliant with the Privacy Policy. The Privacy Policy is the most prominent policy on the far left on the footer of every page served by every editable project, and says explicitly that consent is required for the use of geolocation. The Privacy and other policies make it clear that POST requests and Visual Editor submissions aren't going to be anonymized.

However, geolocations for POST edit and visual editor submissions still require explicit consent which we have no way to obtain at present. Editors' geolocations as they edit are very useful for research, but by the same token have the most serious privacy concerns. Obtaining consent to store geolocation seems like it would interfere with, complicate, and disrupt editing. If geolocation is stored with anonymized IP addresses for GETs but not POSTs or Visual Editor submissions, both could easily be recovered because of simultaneously interleaved GET and POST requests for the same article are unavoidable.

Do we have any privacy experts on staff who can give these issues a thorough analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?

If Ops needs IP addresses, they should be able to use synthetic POST requests, as far as I can tell. If they anticipate a need for non-anonymous GET requests, then perhaps some kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before expiring in an hour would solve any anticipated need?

Marcel Ruiz Forns

12:36 p.m.

Hi Pine,

I thought that was specified in either the Privacy Policy or Terms of Use

...

but I can't find the specific reference, and that bothers me.

This is specified in the data retention guidelines: https://meta.wikimedia.org/wiki/Data_retention_guidelines

Cheers!

On Fri, Nov 11, 2016 at 4:11 PM, James Salsman jsalsman@gmail.com wrote:

...

Pine wrote:

...
I tend to think that checkusers will need the plain IP addresses....

I am not suggesting removing the IP addresses or proxy information from POST requests as checkuser requires.

We need to anonymize both IP addresses and proxy information with a secure hash if we want to keep each GET request's geolocation, to be compliant with the Privacy Policy. The Privacy Policy is the most prominent policy on the far left on the footer of every page served by every editable project, and says explicitly that consent is required for the use of geolocation. The Privacy and other policies make it clear that POST requests and Visual Editor submissions aren't going to be anonymized.

However, geolocations for POST edit and visual editor submissions still require explicit consent which we have no way to obtain at present. Editors' geolocations as they edit are very useful for research, but by the same token have the most serious privacy concerns. Obtaining consent to store geolocation seems like it would interfere with, complicate, and disrupt editing. If geolocation is stored with anonymized IP addresses for GETs but not POSTs or Visual Editor submissions, both could easily be recovered because of simultaneously interleaved GET and POST requests for the same article are unavoidable.

Do we have any privacy experts on staff who can give these issues a thorough analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?

If Ops needs IP addresses, they should be able to use synthetic POST requests, as far as I can tell. If they anticipate a need for non-anonymous GET requests, then perhaps some kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before expiring in an hour would solve any anticipated need?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

James Salsman

2:07 p.m.

Nuria Ruiz wrote:

...

.... on our end we need buffer time that allows us to know that should there be a bug we can reprocess pageviews if needed (this does happen). That buffer time is now 60 days and perhaps it could be a bit smaller but it is still going to be a matter of weeks, not days for which the raw data needs to be available.

Do the advantages of keeping unanonymized IP reader logs for potential debugging needs outweigh the privacy disadvantages?

What are the outcomes impacting users of the hypothetical loss of pageviews data compared to a PII leak?

On Fri, Nov 11, 2016 at 8:11 AM, James Salsman jsalsman@gmail.com wrote:

...

Pine wrote:

...
I tend to think that checkusers will need the plain IP addresses....

I am not suggesting removing the IP addresses or proxy information from POST requests as checkuser requires.

We need to anonymize both IP addresses and proxy information with a secure hash if we want to keep each GET request's geolocation, to be compliant with the Privacy Policy. The Privacy Policy is the most prominent policy on the far left on the footer of every page served by every editable project, and says explicitly that consent is required for the use of geolocation. The Privacy and other policies make it clear that POST requests and Visual Editor submissions aren't going to be anonymized.

However, geolocations for POST edit and visual editor submissions still require explicit consent which we have no way to obtain at present. Editors' geolocations as they edit are very useful for research, but by the same token have the most serious privacy concerns. Obtaining consent to store geolocation seems like it would interfere with, complicate, and disrupt editing. If geolocation is stored with anonymized IP addresses for GETs but not POSTs or Visual Editor submissions, both could easily be recovered because of simultaneously interleaved GET and POST requests for the same article are unavoidable.

Do we have any privacy experts on staff who can give these issues a thorough analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?

If Ops needs IP addresses, they should be able to use synthetic POST requests, as far as I can tell. If they anticipate a need for non-anonymous GET requests, then perhaps some kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before expiring in an hour would solve any anticipated need?

James Salsman

2:32 p.m.

Nuria Ruiz wrote:

...

.... You can bring that up with ops team, I doubt we can operate a website for hundreds off millions of devices (almost a billion) and troubleshoot networking issues, DOS and others without having access to raw IPs for a short period of time. Ops work doesn't need to have access to IP data long term, just near term.

First, I don't know who or where to ask such questions of the Ops team.

Second, is the suggestion to discard before storing as the default behavior with a manual switch that can be turned on by Ops to store temporary raw logs with IP for debugging if and when needed by Ops for a limited time -- say an hour or two -- with automatic zeroing deletion at the end of that time period a viable solution to this contingency?

Nuria Ruiz

2:39 p.m.

...

First, I don't know who or where to ask such questions of the Ops team.

You can try ops channel on irc #wikimedia-ops

...

Second, is the suggestion to discard before storing as the default behavior with a manual switch that can be turned on by Ops to store temporary raw logs with IP for debugging if and when needed by Ops for a limited time -- say an hour or two -- with automatic zeroing deletion at the end of that

time

...

period a viable solution to this contingency?

Given that ops issues are likely to last more than an hour or two .. ahem, I am going to say that no, probably this is not viable. If logging was to be turned on for days, well, then , maybe. Not sure, though and even in that case the solution is not that different from what it is already happening as IPs, again, are not kept long term.

On Fri, Nov 11, 2016 at 9:32 AM, James Salsman jsalsman@gmail.com wrote:

...

Nuria Ruiz wrote:

...
.... You can bring that up with ops team, I doubt we can operate a website for hundreds off millions of devices (almost a billion) and troubleshoot networking issues, DOS and others without having access to raw IPs for a short period of time. Ops work doesn't need to have access to IP data

long

...
term, just near term.

First, I don't know who or where to ask such questions of the Ops team.

Second, is the suggestion to discard before storing as the default behavior with a manual switch that can be turned on by Ops to store temporary raw logs with IP for debugging if and when needed by Ops for a limited time -- say an hour or two -- with automatic zeroing deletion at the end of that time period a viable solution to this contingency?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

2:50 p.m.

...

Do the advantages of keeping unanonymized IP reader logs for potential debugging needs outweigh the privacy disadvantages?

Judging from prior postings to this list the community members interest in correctness of pageview data, pageview tools and pageview API far outweights the concerns with a 60 day retention of raw IPs.

Again, repeating myself: could we make this 60 days interval slightly smaller? Yes, probably, a bit. Could we do without short term retention of raw IPs? No, not really.

On Fri, Nov 11, 2016 at 9:07 AM, James Salsman jsalsman@gmail.com wrote:

...

Nuria Ruiz wrote:

...
.... on our end we need buffer time that allows us to know that should there be a bug we can reprocess pageviews if needed (this does happen). That buffer time is now 60 days and perhaps it could be a bit smaller but it is still going to be a matter of weeks, not days for which the raw data needs to be available.

Do the advantages of keeping unanonymized IP reader logs for potential debugging needs outweigh the privacy disadvantages?

What are the outcomes impacting users of the hypothetical loss of pageviews data compared to a PII leak?

On Fri, Nov 11, 2016 at 8:11 AM, James Salsman jsalsman@gmail.com wrote:

...
Pine wrote:

...
I tend to think that checkusers will need the plain IP addresses....

I am not suggesting removing the IP addresses or proxy information from

POST

...
requests as checkuser requires.

We need to anonymize both IP addresses and proxy information with a

secure

...
hash if we want to keep each GET request's geolocation, to be compliant

with

...
the Privacy Policy. The Privacy Policy is the most prominent policy on

the

...
far left on the footer of every page served by every editable project,

and

...
says explicitly that consent is required for the use of geolocation. The Privacy and other policies make it clear that POST requests and Visual Editor submissions aren't going to be anonymized.

However, geolocations for POST edit and visual editor submissions still require explicit consent which we have no way to obtain at present.

Editors'

...
geolocations as they edit are very useful for research, but by the same token have the most serious privacy concerns. Obtaining consent to store geolocation seems like it would interfere with, complicate, and disrupt editing. If geolocation is stored with anonymized IP addresses for GETs

but

...
not POSTs or Visual Editor submissions, both could easily be recovered because of simultaneously interleaved GET and POST requests for the same article are unavoidable.

Do we have any privacy experts on staff who can give these issues a

thorough

...
analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?

If Ops needs IP addresses, they should be able to use synthetic POST requests, as far as I can tell. If they anticipate a need for

non-anonymous

...
GET requests, then perhaps some kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before expiring in an

hour

...
would solve any anticipated need?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

2:19 p.m.

...

We need to anonymize both IP addresses and proxy information with a secure

hash if we want to keep each GET request's geolocation, to be compliant with the Privacy Policy. Maybe this is not clear, raw IPS are not kept once geolocalization is done. IPs are discarded and geolocation info is the one kept long term.

...

The Privacy Policy is the most prominent policy on the far left on the

footer of every page served by

...

every editable project, and says explicitly that consent is required for

the use of geolocation. The privacy policy talks about client side geo location to offer you geo-specific features on the client side, which is an entirely different topic of what we are taking about here. IP addresses are going to be sent via HTTP regardless with your request and the geo location we do (to be able to report for example pages per country, one of the reports most sought after by our community) has nothing to do with geolocated features.

...

Do we have any privacy experts on staff who can give these issues a

thorough analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ? Anonymization is hard but thus far none is mentioned doing that, right? When it comes to IP data, again, we do not kept it long term neither do we anonymize it with any illusion of privacy, we just discard it as soon as we can. You can read on our research regrading anonymization here. This gist of it is that doing it well is quite hard. https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/K_Anonymi...

...

If Ops needs IP addresses, they should be able to use synthetic POST

requests, as far as I can tell. If they anticipate a need for non-anonymous GET requests, then perhaps some >kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before >expiring in an hour would solve any anticipated need? You can bring that up with ops team, I doubt we can operate a website for hundreds off millions of devices (almost a billion) and troubleshoot networking issues, DOS and others without having access to raw IPs for a short period of time. Ops work doesn't need to have access to IP data long term, just near term.

On Fri, Nov 11, 2016 at 7:11 AM, James Salsman jsalsman@gmail.com wrote:

...

Pine wrote:

...
I tend to think that checkusers will need the plain IP addresses....

I am not suggesting removing the IP addresses or proxy information from POST requests as checkuser requires.

We need to anonymize both IP addresses and proxy information with a secure hash if we want to keep each GET request's geolocation, to be compliant with the Privacy Policy. The Privacy Policy is the most prominent policy on the far left on the footer of every page served by every editable project, and says explicitly that consent is required for the use of geolocation. The Privacy and other policies make it clear that POST requests and Visual Editor submissions aren't going to be anonymized.

However, geolocations for POST edit and visual editor submissions still require explicit consent which we have no way to obtain at present. Editors' geolocations as they edit are very useful for research, but by the same token have the most serious privacy concerns. Obtaining consent to store geolocation seems like it would interfere with, complicate, and disrupt editing. If geolocation is stored with anonymized IP addresses for GETs but not POSTs or Visual Editor submissions, both could easily be recovered because of simultaneously interleaved GET and POST requests for the same article are unavoidable.

Do we have any privacy experts on staff who can give these issues a thorough analysis in light of all the issues raised in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006 ?

If Ops needs IP addresses, they should be able to use synthetic POST requests, as far as I can tell. If they anticipate a need for non-anonymous GET requests, then perhaps some kind of a debugging switch which could be used on a short term basis where an IP range or mask could be entered to allow matching addresses to log non-anonymously before expiring in an hour would solve any anticipated need?

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

2957

Age (days ago)

2964

Last active (days ago)

analytics@lists.wikimedia.org

24 comments

9 participants

tags (0)

participants (9)

C. Scott Ananian
Dan Andreescu
Federico Leva (Nemo)
James Salsman
Leila Zia
Marcel Ruiz Forns
Nuria Ruiz
Pine W
Wikimedia Legal