Records of article access

List overview All Threads
Download

newer

older

s3 replica on analytics-store...

Geowiki no longer showing...

Pine W

17 Oct 2014 17 Oct '14

6:50 a.m.

Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Attachments:

attachment.htm (text/html — 1.9 KB)

Show replies by date

Toby Negrin

17 Oct 17 Oct

7:31 a.m.

Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

...

Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Pine W

7:52 a.m.

Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

...
Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jonathan Morgan

7:27 p.m.

Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

...

Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

...
Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Oliver Keyes

11:55 p.m.

It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

...

Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

...
Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

...
Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Oliver Keyes

11:58 p.m.

I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...

It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

...
Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

...
Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

18 Oct 18 Oct

1:16 a.m.

I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...
It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

...
Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

...
Hi again Analytics,

I was under the impression that no records are kept of which IPs access which articles on Wikipedia when no edits are made, but it appears that such records are in fact kept [1].

Is this proper? This practice appears to be permissible under the Privacy Policy which states that "We use IP addresses for research and analytics; to better personalize content, notices, and settings for you; to fight spam, identity theft, malware, and other kinds of abuse; and to provide better mobile and other applications."

It is possible that this information is relevant for determining the number of unique visitors that Wikipedia gets and that this information is always properly filtered before it gets to the Signpost. However, given recent discussions which I thought said that Wikipedia was not instrumented to track unique visitors, I am surprised to learn that this already seems to be happening and that the situation has been this way for some time, so I would appreciate clarification.

I want to emphasize that this question is about clarifying the practice of tracking likely unique visitors by IP. This question is not intended to start flame wars, get people into trouble, or limit the Signpost's access to properly filtered information if there has been a determination that WMF's retention of the raw data is appropriate. There might be appropriate secondary questions about making sure that access to the raw IP access data is carefully contained and secured.

Thank you very much,

Pine

[1] https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

1:20 a.m.

Folks --

While I'm pleased that this validation was being done by a team member with full knowledge of our privacy and data retention policies, I think some good points have been raised that we're going to need to discuss as a team. I've reached out to legal for their assistance is figuring out the path forward.

-Toby

On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...
It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

...
Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Hi Pine --

Thanks for this -- it's a challenging topic but one that the Analytics team takes very seriously.

I'm not familiar with the IP address review that's referenced in the link. I don't know who the staffer might be. We don't currently calculate unique visitors to anything in Analytics and IP address is not a particularly accurate way to assess unique visitors regardless (due to proxies/NATs/etc).

We do store IPs as part of page requests in our raw logs which are deleted every 30 days. This data is kept on a system where access is limited and controlled by the operations team. We're in line with the privacy policy on this.

To be clear, we are currently considering mechanisms to count unique "requests" -- we rely on Comscore for this data and for several reasons, primarily related to mobile usage, it's not sufficient to understand our usage patterns. We are putting together some proposals to do this in as limited way as possible and that's respectful to our users. We'll share this with the community when we feel we understand the use cases and trade-offs well enough to discuss in an informed manner.

-Toby

We do store the IP address associated with varnish requests as part of the log. This data is

On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote:

> Hi again Analytics, > > I was under the impression that no records are kept of which IPs > access which articles on Wikipedia when no edits are made, but it appears > that such records are in fact kept [1]. > > Is this proper? This practice appears to be permissible under the > Privacy Policy which states that "We use IP addresses for research and > analytics; to better personalize content, notices, and settings for you; to > fight spam, identity theft, malware, and other kinds of abuse; and to > provide better mobile and other applications." > > It is possible that this information is relevant for determining the > number of unique visitors that Wikipedia gets and that this information is > always properly filtered before it gets to the Signpost. However, given > recent discussions which I thought said that Wikipedia was not instrumented > to track unique visitors, I am surprised to learn that this already seems > to be happening and that the situation has been this way for some time, so > I would appreciate clarification. > > I want to emphasize that this question is about clarifying the > practice of tracking likely unique visitors by IP. This question is not > intended to start flame wars, get people into trouble, or limit the > Signpost's access to properly filtered information if there has been a > determination that WMF's retention of the raw data is appropriate. There > might be appropriate secondary questions about making sure that access to > the raw IP access data is carefully contained and secured. > > Thank you very much, > > Pine > > [1] > https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif... > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Pine W

19 Oct 19 Oct

10:24 a.m.

Thanks very much, Toby and everyone.

Ironholds, I appreciate your doing traffic research on a volunteer basis for the benefit of the Signpost and the community. I'm concerned about the system as a whole may need a closer look, and I'm glad that Toby will be doing this with input from Legal.

Toby: I hope we can continue to get some Ironholds-sponsored filtering for the Traffic Report, although we may need to get it with some additional conditions attached.

Thanks and regards,

Pine

On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

Folks --

While I'm pleased that this validation was being done by a team member with full knowledge of our privacy and data retention policies, I think some good points have been raised that we're going to need to discuss as a team. I've reached out to legal for their assistance is figuring out the path forward.

-Toby

On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...
It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

...
Thanks Toby.

I understand that IPs are not an especially accurate way to look at unique visitors, but for the purposes of the Signpost's traffic report and the Top 25 I feel that they are reasonable approximations of ways to filter out what appear to be automated requests.

I am ok with holding those logs for 30 days, although I am a little surprised to hear that this is happening. However, what worries me a bit more is the idea that a staff member can be accessing those logs without that access being recorded. This might be something that you wish to investigate further.

I am not interested in getting this staff person into trouble. The information that they are providing is useful to the Signpost and certainly seems to be sanitized to a reasonable degree. However, it does concern me that they can access these logs without someone knowing about it, it seems to me that this sort of activity should be proactively disclosed to people in WMF who conduct legal and security reviews, and I hope you will consider what sort of security features are appropriate to make sure that occasions when anyone accesses the raw logs are recorded in a robust manner. I worry that if this one staffer can access logs without the higher-ups knowing about it, it is possible that someone who intends to do unethical activities with WMF's data could also access the logs without being noticed.

Thanks,

Pine

On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org wrote:

> Hi Pine -- > > Thanks for this -- it's a challenging topic but one that the > Analytics team takes very seriously. > > I'm not familiar with the IP address review that's referenced in the > link. I don't know who the staffer might be. We don't currently calculate > unique visitors to anything in Analytics and IP address is not a > particularly accurate way to assess unique visitors regardless (due to > proxies/NATs/etc). > > We do store IPs as part of page requests in our raw logs which are > deleted every 30 days. This data is kept on a system where access is > limited and controlled by the operations team. We're in line with the > privacy policy on this. > > To be clear, we are currently considering mechanisms to count unique > "requests" -- we rely on Comscore for this data and for several reasons, > primarily related to mobile usage, it's not sufficient to understand our > usage patterns. We are putting together some proposals to do this in as > limited way as possible and that's respectful to our users. We'll share > this with the community when we feel we understand the use cases and > trade-offs well enough to discuss in an informed manner. > > -Toby > > > > We do store the IP address associated with varnish requests as part > of the log. This data is > > > > On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com wrote: > >> Hi again Analytics, >> >> I was under the impression that no records are kept of which IPs >> access which articles on Wikipedia when no edits are made, but it appears >> that such records are in fact kept [1]. >> >> Is this proper? This practice appears to be permissible under the >> Privacy Policy which states that "We use IP addresses for research and >> analytics; to better personalize content, notices, and settings for you; to >> fight spam, identity theft, malware, and other kinds of abuse; and to >> provide better mobile and other applications." >> >> It is possible that this information is relevant for determining >> the number of unique visitors that Wikipedia gets and that this information >> is always properly filtered before it gets to the Signpost. However, given >> recent discussions which I thought said that Wikipedia was not instrumented >> to track unique visitors, I am surprised to learn that this already seems >> to be happening and that the situation has been this way for some time, so >> I would appreciate clarification. >> >> I want to emphasize that this question is about clarifying the >> practice of tracking likely unique visitors by IP. This question is not >> intended to start flame wars, get people into trouble, or limit the >> Signpost's access to properly filtered information if there has been a >> determination that WMF's retention of the raw data is appropriate. There >> might be appropriate secondary questions about making sure that access to >> the raw IP access data is carefully contained and secured. >> >> Thank you very much, >> >> Pine >> >> [1] >> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif... >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

20 Oct 20 Oct

5:15 p.m.

Sorry, but no; what "additional conditions attached"? We're *not giving them any information* except for a boolean "this looks like illegitimate traffic, this one is legitimate or we can't tell" and a wild stab at what kind of illegitimate traffic it might be.

Please bear in mind that what you're essentially saying - or, how it's coming off - is that there is some shady, undocumented, privacy-policy-thorny thing going on here. That's a pretty big statement to make about the activities of a researcher. If you think you can substantiate it: tell me what conditions you might attach to the aforementioned information? Better yet, what information do you think is being transmitted? If you don't think you can substantiate it, don't say it.

Again, I'm sorry to be blunt. But to me this is kind of a big deal. If I've screwed up in some way I'd like you to stop talking in subtext and tell me how you think I have. Because at the moment I'm not entirely sure what I'm meant to be clarifying. But if I haven't, this sort of discussion can have a big impact on someone's reputation, and I'd like to clear it up.

On 19 October 2014 03:24, Pine W wiki.pine@gmail.com wrote:

...

Thanks very much, Toby and everyone.

Ironholds, I appreciate your doing traffic research on a volunteer basis for the benefit of the Signpost and the community. I'm concerned about the system as a whole may need a closer look, and I'm glad that Toby will be doing this with input from Legal.

Toby: I hope we can continue to get some Ironholds-sponsored filtering for the Traffic Report, although we may need to get it with some additional conditions attached.

Thanks and regards,

Pine

On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Folks --

While I'm pleased that this validation was being done by a team member with full knowledge of our privacy and data retention policies, I think some good points have been raised that we're going to need to discuss as a team. I've reached out to legal for their assistance is figuring out the path forward.

-Toby

On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...
It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

...
Pine, have you considered asking Milowent who they work with on the IP data? I really, really doubt that there is some sort of shady back-alley data dealing going down here. - Jonathan

On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote:

> Thanks Toby. > > I understand that IPs are not an especially accurate way to look at > unique visitors, but for the purposes of the Signpost's traffic report and > the Top 25 I feel that they are reasonable approximations of ways to filter > out what appear to be automated requests. > > I am ok with holding those logs for 30 days, although I am a little > surprised to hear that this is happening. However, what worries me a bit > more is the idea that a staff member can be accessing those logs without > that access being recorded. This might be something that you wish to > investigate further. > > I am not interested in getting this staff person into trouble. The > information that they are providing is useful to the Signpost and certainly > seems to be sanitized to a reasonable degree. However, it does concern me > that they can access these logs without someone knowing about it, it seems > to me that this sort of activity should be proactively disclosed to people > in WMF who conduct legal and security reviews, and I hope you will consider > what sort of security features are appropriate to make sure that occasions > when anyone accesses the raw logs are recorded in a robust manner. I worry > that if this one staffer can access logs without the higher-ups knowing > about it, it is possible that someone who intends to do unethical > activities with WMF's data could also access the logs without being noticed. > > Thanks, > > Pine > > > On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin tnegrin@wikimedia.org > wrote: > >> Hi Pine -- >> >> Thanks for this -- it's a challenging topic but one that the >> Analytics team takes very seriously. >> >> I'm not familiar with the IP address review that's referenced in >> the link. I don't know who the staffer might be. We don't currently >> calculate unique visitors to anything in Analytics and IP address is not a >> particularly accurate way to assess unique visitors regardless (due to >> proxies/NATs/etc). >> >> We do store IPs as part of page requests in our raw logs which are >> deleted every 30 days. This data is kept on a system where access is >> limited and controlled by the operations team. We're in line with the >> privacy policy on this. >> >> To be clear, we are currently considering mechanisms to count >> unique "requests" -- we rely on Comscore for this data and for several >> reasons, primarily related to mobile usage, it's not sufficient to >> understand our usage patterns. We are putting together some proposals to do >> this in as limited way as possible and that's respectful to our users. >> We'll share this with the community when we feel we understand the use >> cases and trade-offs well enough to discuss in an informed manner. >> >> -Toby >> >> >> >> We do store the IP address associated with varnish requests as part >> of the log. This data is >> >> >> >> On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com >> wrote: >> >>> Hi again Analytics, >>> >>> I was under the impression that no records are kept of which IPs >>> access which articles on Wikipedia when no edits are made, but it appears >>> that such records are in fact kept [1]. >>> >>> Is this proper? This practice appears to be permissible under the >>> Privacy Policy which states that "We use IP addresses for research and >>> analytics; to better personalize content, notices, and settings for you; to >>> fight spam, identity theft, malware, and other kinds of abuse; and to >>> provide better mobile and other applications." >>> >>> It is possible that this information is relevant for determining >>> the number of unique visitors that Wikipedia gets and that this information >>> is always properly filtered before it gets to the Signpost. However, given >>> recent discussions which I thought said that Wikipedia was not instrumented >>> to track unique visitors, I am surprised to learn that this already seems >>> to be happening and that the situation has been this way for some time, so >>> I would appreciate clarification. >>> >>> I want to emphasize that this question is about clarifying the >>> practice of tracking likely unique visitors by IP. This question is not >>> intended to start flame wars, get people into trouble, or limit the >>> Signpost's access to properly filtered information if there has been a >>> determination that WMF's retention of the raw data is appropriate. There >>> might be appropriate secondary questions about making sure that access to >>> the raw IP access data is carefully contained and secured. >>> >>> Thank you very much, >>> >>> Pine >>> >>> [1] >>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif... >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Jonathan Morgan

5:39 p.m.

On Mon, Oct 20, 2014 at 7:15 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Sorry, but no; what "additional conditions attached"? We're *not giving them any information* except for a boolean "this looks like illegitimate traffic, this one is legitimate or we can't tell" and a wild stab at what kind of illegitimate traffic it might be.

Please bear in mind that what you're essentially saying - or, how it's coming off - is that there is some shady, undocumented, privacy-policy-thorny thing going on here. That's a pretty big statement to make about the activities of a researcher. If you think you can substantiate it: tell me what conditions you might attach to the aforementioned information? Better yet, what information do you think is being transmitted? If you don't think you can substantiate it, don't say it.

Again, I'm sorry to be blunt. But to me this is kind of a big deal. If I've screwed up in some way I'd like you to stop talking in subtext and tell me how you think I have. Because at the moment I'm not entirely sure what I'm meant to be clarifying. But if I haven't, this sort of discussion can have a big impact on someone's reputation, and I'd like to clear it up.

On 19 October 2014 03:24, Pine W wiki.pine@gmail.com wrote:

...
Thanks very much, Toby and everyone.

Ironholds, I appreciate your doing traffic research on a volunteer basis for the benefit of the Signpost and the community. I'm concerned about the system as a whole may need a closer look, and I'm glad that Toby will be doing this with input from Legal.

Toby: I hope we can continue to get some Ironholds-sponsored filtering for the Traffic Report, although we may need to get it with some additional conditions attached.

Thanks and regards,

Pine

On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Folks --

While I'm pleased that this validation was being done by a team member with full knowledge of our privacy and data retention policies, I think some good points have been raised that we're going to need to discuss as a team. I've reached out to legal for their assistance is figuring out the path forward.

-Toby

On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...
It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

> Pine, have you considered asking Milowent who they work with on the > IP data? I really, really doubt that there is some sort of shady back-alley > data dealing going down here. - Jonathan > > On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote: > >> Thanks Toby. >> >> I understand that IPs are not an especially accurate way to look at >> unique visitors, but for the purposes of the Signpost's traffic report and >> the Top 25 I feel that they are reasonable approximations of ways to filter >> out what appear to be automated requests. >> >> I am ok with holding those logs for 30 days, although I am a little >> surprised to hear that this is happening. However, what worries me a bit >> more is the idea that a staff member can be accessing those logs without >> that access being recorded. This might be something that you wish to >> investigate further. >> >> I am not interested in getting this staff person into trouble. The >> information that they are providing is useful to the Signpost and certainly >> seems to be sanitized to a reasonable degree. However, it does concern me >> that they can access these logs without someone knowing about it, it seems >> to me that this sort of activity should be proactively disclosed to people >> in WMF who conduct legal and security reviews, and I hope you will consider >> what sort of security features are appropriate to make sure that occasions >> when anyone accesses the raw logs are recorded in a robust manner. I worry >> that if this one staffer can access logs without the higher-ups knowing >> about it, it is possible that someone who intends to do unethical >> activities with WMF's data could also access the logs without being noticed. >> >> Thanks, >> >> Pine >> >> >> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <tnegrin@wikimedia.org >> > wrote: >> >>> Hi Pine -- >>> >>> Thanks for this -- it's a challenging topic but one that the >>> Analytics team takes very seriously. >>> >>> I'm not familiar with the IP address review that's referenced in >>> the link. I don't know who the staffer might be. We don't currently >>> calculate unique visitors to anything in Analytics and IP address is not a >>> particularly accurate way to assess unique visitors regardless (due to >>> proxies/NATs/etc). >>> >>> We do store IPs as part of page requests in our raw logs which are >>> deleted every 30 days. This data is kept on a system where access is >>> limited and controlled by the operations team. We're in line with the >>> privacy policy on this. >>> >>> To be clear, we are currently considering mechanisms to count >>> unique "requests" -- we rely on Comscore for this data and for several >>> reasons, primarily related to mobile usage, it's not sufficient to >>> understand our usage patterns. We are putting together some proposals to do >>> this in as limited way as possible and that's respectful to our users. >>> We'll share this with the community when we feel we understand the use >>> cases and trade-offs well enough to discuss in an informed manner. >>> >>> -Toby >>> >>> >>> >>> We do store the IP address associated with varnish requests as >>> part of the log. This data is >>> >>> >>> >>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com >>> wrote: >>> >>>> Hi again Analytics, >>>> >>>> I was under the impression that no records are kept of which IPs >>>> access which articles on Wikipedia when no edits are made, but it appears >>>> that such records are in fact kept [1]. >>>> >>>> Is this proper? This practice appears to be permissible under the >>>> Privacy Policy which states that "We use IP addresses for research and >>>> analytics; to better personalize content, notices, and settings for you; to >>>> fight spam, identity theft, malware, and other kinds of abuse; and to >>>> provide better mobile and other applications." >>>> >>>> It is possible that this information is relevant for determining >>>> the number of unique visitors that Wikipedia gets and that this information >>>> is always properly filtered before it gets to the Signpost. However, given >>>> recent discussions which I thought said that Wikipedia was not instrumented >>>> to track unique visitors, I am surprised to learn that this already seems >>>> to be happening and that the situation has been this way for some time, so >>>> I would appreciate clarification. >>>> >>>> I want to emphasize that this question is about clarifying the >>>> practice of tracking likely unique visitors by IP. This question is not >>>> intended to start flame wars, get people into trouble, or limit the >>>> Signpost's access to properly filtered information if there has been a >>>> determination that WMF's retention of the raw data is appropriate. There >>>> might be appropriate secondary questions about making sure that access to >>>> the raw IP access data is carefully contained and secured. >>>> >>>> Thank you very much, >>>> >>>> Pine >>>> >>>> [1] >>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif... >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Jonathan T. Morgan > Learning Strategist > Wikimedia Foundation > User:Jmorgan (WMF) > https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) > jmorgan@wikimedia.org > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Jonathan T. Morgan Learning Strategist Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) jmorgan@wikimedia.org

Pine W

8:14 p.m.

Sorry, let's see if I can rephrase.

The issue is not really about the Signpost (I didn't know which person was involved, and I wasn't planning to ask for a name; as I said up front, I am not interested in stirring up trouble). My questions were about how to reconcile that access with the recent discussions about instrumenting Wikipedia to track unique readers (which I was under the impression is still in the planning stages, so I was surprised to find out that this already happens), how to reconcile that access with the Privacy Policy, and how to make sure that in the general case of anyone accessing raw access logs that the access itself is logged. On the last point, to use an analogy, it's like someone accessing patient charts in a medical facility; there might be a good reason for a technician to view 100s of records of patients that he/she wasn't directly involved in treating, such as if the technician is helping to conduct a study of which doctors prescribe which treatments most often; on the other hand, if a technician is able to access those records at will and without that access being logged, then this creates worrisome potential for large-scale data harvesting for unauthorized uses without that access being noticed, including uses by someone whose account is compromised by a third party. If I was accessing the Wikipedia raw logs, I would expect that my access would be logged and monitored in the same way that I'm suggesting should be happening here, and it's not because I'm any more or less trustworthy than anyone else.

Does that make sense? I am less worried about the specific case of the Signpost and more worried about the general case of how the raw logs are accessed and making sure that there are good controls and logs for that access.

Thanks, Pine

Pine

*This is an Encyclopedia https://www.wikipedia.org/One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*

*—Catherine Munro*

On Mon, Oct 20, 2014 at 7:15 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Sorry, but no; what "additional conditions attached"? We're *not giving them any information* except for a boolean "this looks like illegitimate traffic, this one is legitimate or we can't tell" and a wild stab at what kind of illegitimate traffic it might be.

Please bear in mind that what you're essentially saying - or, how it's coming off - is that there is some shady, undocumented, privacy-policy-thorny thing going on here. That's a pretty big statement to make about the activities of a researcher. If you think you can substantiate it: tell me what conditions you might attach to the aforementioned information? Better yet, what information do you think is being transmitted? If you don't think you can substantiate it, don't say it.

Again, I'm sorry to be blunt. But to me this is kind of a big deal. If I've screwed up in some way I'd like you to stop talking in subtext and tell me how you think I have. Because at the moment I'm not entirely sure what I'm meant to be clarifying. But if I haven't, this sort of discussion can have a big impact on someone's reputation, and I'd like to clear it up.

On 19 October 2014 03:24, Pine W wiki.pine@gmail.com wrote:

...
Thanks very much, Toby and everyone.

Ironholds, I appreciate your doing traffic research on a volunteer basis for the benefit of the Signpost and the community. I'm concerned about the system as a whole may need a closer look, and I'm glad that Toby will be doing this with input from Legal.

Toby: I hope we can continue to get some Ironholds-sponsored filtering for the Traffic Report, although we may need to get it with some additional conditions attached.

Thanks and regards,

Pine

On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Folks --

While I'm pleased that this validation was being done by a team member with full knowledge of our privacy and data retention policies, I think some good points have been raised that we're going to need to discuss as a team. I've reached out to legal for their assistance is figuring out the path forward.

-Toby

On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

...
It's me. Hi! I'm sort of confused by this.

In terms of shady back-alley data dealing, let me set out exactly what happens.

Every week, the signpost emails me a list of articles that have unexpectedly high pageview counts and would be in the top 25, but nobody can quite work out why they're so popular. I go through the logs for the last week (I'd be unable to do this for any queries more than a month ago anyway, since we only keep the unsampled data for that long, but a week is what's relevant here), and pull out a tuple of {ip,referer,user agent,article, requests} for the articles on that list.

These tuples, which exist exclusively on our analytics machines (not even my personal, encrypted work laptop: they're only stored server-side, at all steps in this) are than hand-parsed by me. Can we pin all of the requests for [article], or at least most of them, on a single IP address, or a single {IP,user_agent} pair? Then it's probably a spammer or a spider or an [expletive]. No? Okay, if we sum by referer, do we see a common referer? If so, is that an actual referer or a fly-by-night live mirror? Questions like that.

When I'm done with all of the articles, I email the signpost with "for article1, that looks legit. Article2 is a web crawler I'm going to email and shout at. Article3 is a live mirror. Article4 looks legit. Article5...". These requests are logged on our trello board, just like any other data request from any other party, community or staff. Milowent and the other signposters get zero IPs, zero user agents, and nothing anywhere near that range of information: that stuff doesn't even leave the server. And when I'm done with it, I nuke it so it's not even *there*.

I hope that clarifies what's happening here. If you have specific questions about what we keep that's obviously more of a question for management.

On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org wrote:

> Pine, have you considered asking Milowent who they work with on the > IP data? I really, really doubt that there is some sort of shady back-alley > data dealing going down here. - Jonathan > > On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com wrote: > >> Thanks Toby. >> >> I understand that IPs are not an especially accurate way to look at >> unique visitors, but for the purposes of the Signpost's traffic report and >> the Top 25 I feel that they are reasonable approximations of ways to filter >> out what appear to be automated requests. >> >> I am ok with holding those logs for 30 days, although I am a little >> surprised to hear that this is happening. However, what worries me a bit >> more is the idea that a staff member can be accessing those logs without >> that access being recorded. This might be something that you wish to >> investigate further. >> >> I am not interested in getting this staff person into trouble. The >> information that they are providing is useful to the Signpost and certainly >> seems to be sanitized to a reasonable degree. However, it does concern me >> that they can access these logs without someone knowing about it, it seems >> to me that this sort of activity should be proactively disclosed to people >> in WMF who conduct legal and security reviews, and I hope you will consider >> what sort of security features are appropriate to make sure that occasions >> when anyone accesses the raw logs are recorded in a robust manner. I worry >> that if this one staffer can access logs without the higher-ups knowing >> about it, it is possible that someone who intends to do unethical >> activities with WMF's data could also access the logs without being noticed. >> >> Thanks, >> >> Pine >> >> >> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <tnegrin@wikimedia.org >> > wrote: >> >>> Hi Pine -- >>> >>> Thanks for this -- it's a challenging topic but one that the >>> Analytics team takes very seriously. >>> >>> I'm not familiar with the IP address review that's referenced in >>> the link. I don't know who the staffer might be. We don't currently >>> calculate unique visitors to anything in Analytics and IP address is not a >>> particularly accurate way to assess unique visitors regardless (due to >>> proxies/NATs/etc). >>> >>> We do store IPs as part of page requests in our raw logs which are >>> deleted every 30 days. This data is kept on a system where access is >>> limited and controlled by the operations team. We're in line with the >>> privacy policy on this. >>> >>> To be clear, we are currently considering mechanisms to count >>> unique "requests" -- we rely on Comscore for this data and for several >>> reasons, primarily related to mobile usage, it's not sufficient to >>> understand our usage patterns. We are putting together some proposals to do >>> this in as limited way as possible and that's respectful to our users. >>> We'll share this with the community when we feel we understand the use >>> cases and trade-offs well enough to discuss in an informed manner. >>> >>> -Toby >>> >>> >>> >>> We do store the IP address associated with varnish requests as >>> part of the log. This data is >>> >>> >>> >>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com >>> wrote: >>> >>>> Hi again Analytics, >>>> >>>> I was under the impression that no records are kept of which IPs >>>> access which articles on Wikipedia when no edits are made, but it appears >>>> that such records are in fact kept [1]. >>>> >>>> Is this proper? This practice appears to be permissible under the >>>> Privacy Policy which states that "We use IP addresses for research and >>>> analytics; to better personalize content, notices, and settings for you; to >>>> fight spam, identity theft, malware, and other kinds of abuse; and to >>>> provide better mobile and other applications." >>>> >>>> It is possible that this information is relevant for determining >>>> the number of unique visitors that Wikipedia gets and that this information >>>> is always properly filtered before it gets to the Signpost. However, given >>>> recent discussions which I thought said that Wikipedia was not instrumented >>>> to track unique visitors, I am surprised to learn that this already seems >>>> to be happening and that the situation has been this way for some time, so >>>> I would appreciate clarification. >>>> >>>> I want to emphasize that this question is about clarifying the >>>> practice of tracking likely unique visitors by IP. This question is not >>>> intended to start flame wars, get people into trouble, or limit the >>>> Signpost's access to properly filtered information if there has been a >>>> determination that WMF's retention of the raw data is appropriate. There >>>> might be appropriate secondary questions about making sure that access to >>>> the raw IP access data is carefully contained and secured. >>>> >>>> Thank you very much, >>>> >>>> Pine >>>> >>>> [1] >>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif... >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Jonathan T. Morgan > Learning Strategist > Wikimedia Foundation > User:Jmorgan (WMF) > https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) > jmorgan@wikimedia.org > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

-- Oliver Keyes Research Analyst Wikimedia Foundation

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

8:36 p.m.

On 20 October 2014 13:14, Pine W wiki.pine@gmail.com wrote:

...

Sorry, let's see if I can rephrase.

The issue is not really about the Signpost (I didn't know which person was involved, and I wasn't planning to ask for a name; as I said up front, I am not interested in stirring up trouble). My questions were about how to reconcile that access with the recent discussions about instrumenting Wikipedia to track unique readers (which I was under the impression is still in the planning stages, so I was surprised to find out that this already happens),

No, it doesn't. It is still in the planning stages. The reason it doesn't currently happen (well, other than that we have historically not had any need for it) is that fingerprinting based on user agent and IP address is fundamentally unreliable. You cannot grab all requests from one person or one client on a consistent basis using that.

However, if you have three million requests in five hours from one IP address, using the same user agent each time, we can /probably say/ that they're the same person, and actually not a person at all, based on the information available.

This does not happen on a regular or consistent basis; it does not happen even within the Signpost requests most of the time. I don't care what the {ip,user_agent,referer} tuple outputs, merely how many combinations it outputs. It's simply a very basic heuristic for "how likely is it that this is natural traffic". Natural traffic has a wide range of tuples. Unnatural traffic does not.

Outside of this situation (obvious spiders) the only time we have done it (at least, when I've been on the R&D team, so since January) was for a session analysis project that is fully documented on meta ( https://meta.wikimedia.org/wiki/Research:Mobile_sessions) and has a transparent codebase (https://github.com/Ironholds/MobileSessions). As part of that study I conducted a series of entropy tests that ascertained that this kind of fingerprinting was completely useless for generic user habits.

...

how to reconcile that access with the Privacy Policy

What do you mean by this? What element of it conflicts with the privacy policy, or in your mind, might conflict with the privacy policy? Again, I'd like explicit statements of "I perceive that X happened", not "I have questions about this broad domain", if possible.

...

and how to make sure that in the general case of anyone accessing raw access logs that the access itself is logged.

On the last point, to use an analogy, it's like someone accessing patient

...

charts in a medical facility; there might be a good reason for a technician to view 100s of records of patients that he/she wasn't directly involved in treating, such as if the technician is helping to conduct a study of which doctors prescribe which treatments most often; on the other hand, if a technician is able to access those records at will and without that access being logged, then this creates worrisome potential for large-scale data harvesting for unauthorized uses without that access being noticed, including uses by someone whose account is compromised by a third party.

Certainly. However, for someone to do this, they would have to compromise my private SSH key - a key that lives in two machines, both of which have full disc encryption, one to Xubuntu standards and one under cascading AES-Twofish-Serpent. They would have to do this without me noticing and immediately reporting it to Operations and having my keys revoked. They would then have to navigate to the one specific cluster of machines able to access these logs (which live behind a bastion), and, because of the dataset size, run a query lasting several hours or days to get anything useful.

Once they had this useful thing, and assuming they could download it to their machine, they would be confronted with the problem that every query run against this system is already internally logged and stored, along with the username of the person who triggered it. This service runs distinctly to our Trello instance, which logs the actual rationale for particular research projects in a transparent, community-and-staffer-accessible way.

So: yes, someone external could do this, although it would be fairly hard and frankly once you assume someone can compromise SSH keys without anyone noticing all of our infrastructure is screwed. And I could do this, internally. But were either of these situations to occur, we already have automated logging of the actual queries and a social convention around logging active research projects, be they related to pageviews or no, in a way that allows for staff and community observation and review.

I guess mostly I'm just confused as to what you'd add on top of "SSH keys, automated logging and transparent documentation".

...

If I was accessing the Wikipedia raw logs, I would expect that my access would be logged and monitored in the same way that I'm suggesting should be happening here, and it's not because I'm any more or less trustworthy than anyone else.

Does that make sense? I am less worried about the specific case of the Signpost and more worried about the general case of how the raw logs are accessed and making sure that there are good controls and logs for that access.

Thanks,

...

Pine

Pine

*This is an Encyclopedia https://www.wikipedia.org/One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*

*—Catherine Munro*

On Mon, Oct 20, 2014 at 7:15 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Sorry, but no; what "additional conditions attached"? We're *not giving them any information* except for a boolean "this looks like illegitimate traffic, this one is legitimate or we can't tell" and a wild stab at what kind of illegitimate traffic it might be.

Please bear in mind that what you're essentially saying - or, how it's coming off - is that there is some shady, undocumented, privacy-policy-thorny thing going on here. That's a pretty big statement to make about the activities of a researcher. If you think you can substantiate it: tell me what conditions you might attach to the aforementioned information? Better yet, what information do you think is being transmitted? If you don't think you can substantiate it, don't say it.

Again, I'm sorry to be blunt. But to me this is kind of a big deal. If I've screwed up in some way I'd like you to stop talking in subtext and tell me how you think I have. Because at the moment I'm not entirely sure what I'm meant to be clarifying. But if I haven't, this sort of discussion can have a big impact on someone's reputation, and I'd like to clear it up.

On 19 October 2014 03:24, Pine W wiki.pine@gmail.com wrote:

...
Thanks very much, Toby and everyone.

Ironholds, I appreciate your doing traffic research on a volunteer basis for the benefit of the Signpost and the community. I'm concerned about the system as a whole may need a closer look, and I'm glad that Toby will be doing this with input from Legal.

Toby: I hope we can continue to get some Ironholds-sponsored filtering for the Traffic Report, although we may need to get it with some additional conditions attached.

Thanks and regards,

Pine

On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...
Folks --

While I'm pleased that this validation was being done by a team member with full knowledge of our privacy and data retention policies, I think some good points have been raised that we're going to need to discuss as a team. I've reached out to legal for their assistance is figuring out the path forward.

-Toby

On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
I see - Oliver's batman. Nothing to see here, moving on.

On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
I should also point out that "Toby not knowing who the staffer doing this one, highly specific, very minor piece of data-dogging is" does not equate to analytics not knowing who it is. I don't know what you do for a living but do you tend to give your boss's boss a constant play-by-play, or? ;p. It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes okeyes@wikimedia.org wrote:

> It's me. Hi! I'm sort of confused by this. > > In terms of shady back-alley data dealing, let me set out exactly > what happens. > > Every week, the signpost emails me a list of articles that have > unexpectedly high pageview counts and would be in the top 25, but nobody > can quite work out why they're so popular. I go through the logs for the > last week (I'd be unable to do this for any queries more than a month ago > anyway, since we only keep the unsampled data for that long, but a week is > what's relevant here), and pull out a tuple of {ip,referer,user > agent,article, requests} for the articles on that list. > > These tuples, which exist exclusively on our analytics machines (not > even my personal, encrypted work laptop: they're only stored server-side, > at all steps in this) are than hand-parsed by me. Can we pin all of the > requests for [article], or at least most of them, on a single IP address, > or a single {IP,user_agent} pair? Then it's probably a spammer or a spider > or an [expletive]. No? Okay, if we sum by referer, do we see a common > referer? If so, is that an actual referer or a fly-by-night live mirror? > Questions like that. > > When I'm done with all of the articles, I email the signpost with > "for article1, that looks legit. Article2 is a web crawler I'm going to > email and shout at. Article3 is a live mirror. Article4 looks legit. > Article5...". These requests are logged on our trello board, just like any > other data request from any other party, community or staff. Milowent and > the other signposters get zero IPs, zero user agents, and nothing anywhere > near that range of information: that stuff doesn't even leave the server. > And when I'm done with it, I nuke it so it's not even *there*. > > I hope that clarifies what's happening here. If you have specific > questions about what we keep that's obviously more of a question for > management. > > On 17 October 2014 12:27, Jonathan Morgan jmorgan@wikimedia.org > wrote: > >> Pine, have you considered asking Milowent who they work with on the >> IP data? I really, really doubt that there is some sort of shady back-alley >> data dealing going down here. - Jonathan >> >> On Thu, Oct 16, 2014 at 9:52 PM, Pine W wiki.pine@gmail.com >> wrote: >> >>> Thanks Toby. >>> >>> I understand that IPs are not an especially accurate way to look >>> at unique visitors, but for the purposes of the Signpost's traffic report >>> and the Top 25 I feel that they are reasonable approximations of ways to >>> filter out what appear to be automated requests. >>> >>> I am ok with holding those logs for 30 days, although I am a >>> little surprised to hear that this is happening. However, what worries me a >>> bit more is the idea that a staff member can be accessing those logs >>> without that access being recorded. This might be something that you wish >>> to investigate further. >>> >>> I am not interested in getting this staff person into trouble. The >>> information that they are providing is useful to the Signpost and certainly >>> seems to be sanitized to a reasonable degree. However, it does concern me >>> that they can access these logs without someone knowing about it, it seems >>> to me that this sort of activity should be proactively disclosed to people >>> in WMF who conduct legal and security reviews, and I hope you will consider >>> what sort of security features are appropriate to make sure that occasions >>> when anyone accesses the raw logs are recorded in a robust manner. I worry >>> that if this one staffer can access logs without the higher-ups knowing >>> about it, it is possible that someone who intends to do unethical >>> activities with WMF's data could also access the logs without being noticed. >>> >>> Thanks, >>> >>> Pine >>> >>> >>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin < >>> tnegrin@wikimedia.org> wrote: >>> >>>> Hi Pine -- >>>> >>>> Thanks for this -- it's a challenging topic but one that the >>>> Analytics team takes very seriously. >>>> >>>> I'm not familiar with the IP address review that's referenced in >>>> the link. I don't know who the staffer might be. We don't currently >>>> calculate unique visitors to anything in Analytics and IP address is not a >>>> particularly accurate way to assess unique visitors regardless (due to >>>> proxies/NATs/etc). >>>> >>>> We do store IPs as part of page requests in our raw logs which >>>> are deleted every 30 days. This data is kept on a system where access is >>>> limited and controlled by the operations team. We're in line with the >>>> privacy policy on this. >>>> >>>> To be clear, we are currently considering mechanisms to count >>>> unique "requests" -- we rely on Comscore for this data and for several >>>> reasons, primarily related to mobile usage, it's not sufficient to >>>> understand our usage patterns. We are putting together some proposals to do >>>> this in as limited way as possible and that's respectful to our users. >>>> We'll share this with the community when we feel we understand the use >>>> cases and trade-offs well enough to discuss in an informed manner. >>>> >>>> -Toby >>>> >>>> >>>> >>>> We do store the IP address associated with varnish requests as >>>> part of the log. This data is >>>> >>>> >>>> >>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W wiki.pine@gmail.com >>>> wrote: >>>> >>>>> Hi again Analytics, >>>>> >>>>> I was under the impression that no records are kept of which IPs >>>>> access which articles on Wikipedia when no edits are made, but it appears >>>>> that such records are in fact kept [1]. >>>>> >>>>> Is this proper? This practice appears to be permissible under >>>>> the Privacy Policy which states that "We use IP addresses for research and >>>>> analytics; to better personalize content, notices, and settings for you; to >>>>> fight spam, identity theft, malware, and other kinds of abuse; and to >>>>> provide better mobile and other applications." >>>>> >>>>> It is possible that this information is relevant for determining >>>>> the number of unique visitors that Wikipedia gets and that this information >>>>> is always properly filtered before it gets to the Signpost. However, given >>>>> recent discussions which I thought said that Wikipedia was not instrumented >>>>> to track unique visitors, I am surprised to learn that this already seems >>>>> to be happening and that the situation has been this way for some time, so >>>>> I would appreciate clarification. >>>>> >>>>> I want to emphasize that this question is about clarifying the >>>>> practice of tracking likely unique visitors by IP. This question is not >>>>> intended to start flame wars, get people into trouble, or limit the >>>>> Signpost's access to properly filtered information if there has been a >>>>> determination that WMF's retention of the raw data is appropriate. There >>>>> might be appropriate secondary questions about making sure that access to >>>>> the raw IP access data is carefully contained and secured. >>>>> >>>>> Thank you very much, >>>>> >>>>> Pine >>>>> >>>>> [1] >>>>> https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&dif... >>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> >> -- >> Jonathan T. Morgan >> Learning Strategist >> Wikimedia Foundation >> User:Jmorgan (WMF) >> https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF) >> jmorgan@wikimedia.org >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation >

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Jeremy Baron

8:50 p.m.

On Oct 20, 2014 1:36 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:

...

I guess mostly I'm just confused as to what you'd add on top of "SSH

keys, automated logging and transparent documentation".

I *think* Pine was asking for automatic query logging similar to what you've just said is already happening.

Eventually maybe we'll get these types of queries mostly running on hadoop+M/R. (vs. processing a local file on disk) We could publish public logs of M/R jobs and for some of them allow public download of the output. (but this particular query would not allow public downloading of the output because IP/UA string/etc.)

-Jeremy

Oliver Keyes

8:53 p.m.

Makes sense. Yeah, I had a "assuming everyone knows what you know" moment; I appreciate the automated query logging may not be a known thing (for the reasons Jeremy sets out, it's currently accessible only via an internal proxy, which makes it a wee bit difficult for people to know that it exists ;p). Sorry about that.

We could probably do it via Hadoop (it'd be a lot easier to automate!) if we come up with some useful heuristics for what automated activity looks like. I'm hoping that the spider/bot/automation identification as part of the pageviews definition will give us some of that.

On 20 October 2014 13:50, Jeremy Baron jeremy@tuxmachine.com wrote:

...

On Oct 20, 2014 1:36 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:

...
I guess mostly I'm just confused as to what you'd add on top of "SSH

keys, automated logging and transparent documentation".

I *think* Pine was asking for automatic query logging similar to what you've just said is already happening.

Eventually maybe we'll get these types of queries mostly running on hadoop+M/R. (vs. processing a local file on disk) We could publish public logs of M/R jobs and for some of them allow public download of the output. (but this particular query would not allow public downloading of the output because IP/UA string/etc.)

-Jeremy

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Pine W

9:17 p.m.

I think we are now all getting on the same wavelength.

The one piece of this puzzle that I am still missing is understanding how it seems like this traffic research for the Signpost was a surprise to Toby and he was thinking that it would benefit from Legal's input, because if the queries were being logged then I would have thought Toby would be aware of them because he would see them in the logs, and I would think that he and others would be regularly checking the logs to make sure that all accesses look normal. Toby, can you comment on that, and also clarify what part of this you are thinking will benefit from Legal's input?

Thanks,

Pine

*This is an Encyclopedia* https://www.wikipedia.org/

*One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*

*—Catherine Munro*

On Mon, Oct 20, 2014 at 10:53 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Makes sense. Yeah, I had a "assuming everyone knows what you know" moment; I appreciate the automated query logging may not be a known thing (for the reasons Jeremy sets out, it's currently accessible only via an internal proxy, which makes it a wee bit difficult for people to know that it exists ;p). Sorry about that.

We could probably do it via Hadoop (it'd be a lot easier to automate!) if we come up with some useful heuristics for what automated activity looks like. I'm hoping that the spider/bot/automation identification as part of the pageviews definition will give us some of that.

On 20 October 2014 13:50, Jeremy Baron jeremy@tuxmachine.com wrote:

...
On Oct 20, 2014 1:36 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:

...
I guess mostly I'm just confused as to what you'd add on top of "SSH

keys, automated logging and transparent documentation".

I *think* Pine was asking for automatic query logging similar to what you've just said is already happening.

Eventually maybe we'll get these types of queries mostly running on hadoop+M/R. (vs. processing a local file on disk) We could publish public logs of M/R jobs and for some of them allow public download of the output. (but this particular query would not allow public downloading of the output because IP/UA string/etc.)

-Jeremy

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3728

Age (days ago)

3731

Last active (days ago)

analytics@lists.wikimedia.org

15 comments

6 participants

tags (0)

participants (6)

Dan Andreescu
Jeremy Baron
Jonathan Morgan
Oliver Keyes
Pine W
Toby Negrin