Re: [Analytics] Records of article access

17 Oct 2014

I should also point out that "Toby not knowing who the staffer doing this
one, highly specific, very minor piece of data-dogging is" does not equate
to analytics not knowing who it is. I don't know what you do for a living
but do you tend to give your boss's boss a constant play-by-play, or? ;p.
It's documented in Trello just like everything else.

On 17 October 2014 16:55, Oliver Keyes &lt;okeyes(a)wikimedia.org&gt; wrote:

...
  It's me. Hi! I'm sort of confused by this.

 In terms of shady back-alley data dealing, let me set out exactly what
 happens.

 Every week, the signpost emails me a list of articles that have
 unexpectedly high pageview counts and would be in the top 25, but nobody
 can quite work out why they're so popular. I go through the logs for the
 last week (I'd be unable to do this for any queries more than a month ago
 anyway, since we only keep the unsampled data for that long, but a week is
 what's relevant here), and pull out a tuple of {ip,referer,user
 agent,article, requests} for the articles on that list.

 These tuples, which exist exclusively on our analytics machines (not even
 my personal, encrypted work laptop: they're only stored server-side, at all
 steps in this) are than hand-parsed by me. Can we pin all of the requests
 for [article], or at least most of them, on a single IP address, or a
 single {IP,user_agent} pair? Then it's probably a spammer or a spider or an
 [expletive]. No? Okay, if we sum by referer, do we see a common referer? If
 so, is that an actual referer or a fly-by-night live mirror? Questions like
 that.

 When I'm done with all of the articles, I email the signpost with "for
 article1, that looks legit. Article2 is a web crawler I'm going to email
 and shout at. Article3 is a live mirror. Article4 looks legit.
 Article5...". These requests are logged on our trello board, just like any
 other data request from any other party, community or staff. Milowent and
 the other signposters get zero IPs, zero user agents, and nothing anywhere
 near that range of information: that stuff doesn't even leave the server.
 And when I'm done with it, I nuke it so it's not even *there*.

 I hope that clarifies what's happening here. If you have specific
 questions about what we keep that's obviously more of a question for
 management.

 On 17 October 2014 12:27, Jonathan Morgan &lt;jmorgan(a)wikimedia.org&gt; wrote:

  Pine, have you considered asking Milowent who
they work with on the IP
 data? I really, really doubt that there is some sort of shady back-alley
 data dealing going down here. - Jonathan

 On Thu, Oct 16, 2014 at 9:52 PM, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:

  Thanks Toby.

 I understand that IPs are not an especially accurate way to look at
 unique visitors, but for the purposes of the Signpost's traffic report and
 the Top 25 I feel that they are reasonable approximations of ways to filter
 out what appear to be automated requests.

 I am ok with holding those logs for 30 days, although I am a little
 surprised to hear that this is happening. However, what worries me a bit
 more is the idea that a staff member can be accessing those logs without
 that access being recorded. This might be something that you wish to
 investigate further.

 I am not interested in getting this staff person into trouble. The
 information that they are providing is useful to the Signpost and certainly
 seems to be sanitized to a reasonable degree. However, it does concern me
 that they can access these logs without someone knowing about it, it seems
 to me that this sort of activity should be proactively disclosed to people
 in WMF who conduct legal and security reviews, and I hope you will consider
 what sort of security features are appropriate to make sure that occasions
 when anyone accesses the raw logs are recorded in a robust manner. I worry
 that if this one staffer can access logs without the higher-ups knowing
 about it, it is possible that someone who intends to do unethical
 activities with WMF's data could also access the logs without being noticed.

 Thanks,

 Pine

 On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin &lt;tnegrin(a)wikimedia.org&gt;
 wrote:

  Hi Pine --

 Thanks for this -- it's a challenging topic but one that the Analytics
 team takes very seriously.

 I'm not familiar with the IP address review that's referenced in the
 link. I don't know who the staffer might be. We don't currently calculate
 unique visitors to anything in Analytics and IP address is not a
 particularly accurate way to assess unique visitors regardless (due to
 proxies/NATs/etc).

 We do store IPs as part of page requests in our raw logs which are
 deleted every 30 days. This data is kept on a system where access is
 limited and controlled by the operations team. We're in line with the
 privacy policy on this.

 To be clear, we are currently considering mechanisms to count unique
 "requests" -- we rely on Comscore for this data and for several reasons,
 primarily related to mobile usage, it's not sufficient to understand our
 usage patterns. We are putting together some proposals to do this in as
 limited way as possible and that's respectful to our users. We'll share
 this with the community when we feel we understand the use cases and
 trade-offs well enough to discuss in an informed manner.

 -Toby

 We do store the IP address associated with varnish requests as part of
 the log. This data is

 On Thu, Oct 16, 2014 at 8:50 PM, Pine W &lt;wiki.pine(a)gmail.com&gt; wrote:

> Hi again Analytics,
>
> I was under the impression that no records are kept of which IPs
> access which articles on Wikipedia when no edits are made, but it appears
> that such records are in fact kept [1].
>
> Is this proper? This practice appears to be permissible under the
> Privacy Policy which states that "We use IP addresses for research and
> analytics; to better personalize content, notices, and settings for you; to
> fight spam, identity theft, malware, and other kinds of abuse; and to
> provide better mobile and other applications."
>
> It is possible that this information is relevant for determining the
> number of unique visitors that Wikipedia gets and that this information is
> always properly filtered before it gets to the Signpost. However, given
> recent discussions which I thought said that Wikipedia was not instrumented
> to track unique visitors, I am surprised to learn that this already seems
> to be happening and that the situation has been this way for some time, so
> I would appreciate clarification.
>
> I want to emphasize that this question is about clarifying the
> practice of tracking likely unique visitors by IP. This question is not
> intended to start flame wars, get people into trouble, or limit the
> Signpost's access to properly filtered information if there has been a
> determination that WMF's retention of the raw data is appropriate. There
> might be appropriate secondary questions about making sure that access to
> the raw IP access data is carefully contained and secured.
>
> Thank you very much,
>
> Pine
>
> [1]
>
https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&di…
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 --
 Jonathan T. Morgan
 Learning Strategist
 Wikimedia Foundation
 User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
 jmorgan(a)wikimedia.org

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Records of article access