I see - Oliver's batman. Nothing to see
here, moving on.
On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <okeyes(a)wikimedia.org>
wrote:
I should also point out that "Toby not
knowing who the staffer doing
this one, highly specific, very minor piece of data-dogging is" does not
equate to analytics not knowing who it is. I don't know what you do for a
living but do you tend to give your boss's boss a constant play-by-play,
or? ;p. It's documented in Trello just like everything else.
On 17 October 2014 16:55, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
> It's me. Hi! I'm sort of confused by this.
>
> In terms of shady back-alley data dealing, let me set out exactly what
> happens.
>
> Every week, the signpost emails me a list of articles that have
> unexpectedly high pageview counts and would be in the top 25, but nobody
> can quite work out why they're so popular. I go through the logs for the
> last week (I'd be unable to do this for any queries more than a month ago
> anyway, since we only keep the unsampled data for that long, but a week is
> what's relevant here), and pull out a tuple of {ip,referer,user
> agent,article, requests} for the articles on that list.
>
> These tuples, which exist exclusively on our analytics machines (not
> even my personal, encrypted work laptop: they're only stored server-side,
> at all steps in this) are than hand-parsed by me. Can we pin all of the
> requests for [article], or at least most of them, on a single IP address,
> or a single {IP,user_agent} pair? Then it's probably a spammer or a spider
> or an [expletive]. No? Okay, if we sum by referer, do we see a common
> referer? If so, is that an actual referer or a fly-by-night live mirror?
> Questions like that.
>
> When I'm done with all of the articles, I email the signpost with "for
> article1, that looks legit. Article2 is a web crawler I'm going to email
> and shout at. Article3 is a live mirror. Article4 looks legit.
> Article5...". These requests are logged on our trello board, just like any
> other data request from any other party, community or staff. Milowent and
> the other signposters get zero IPs, zero user agents, and nothing anywhere
> near that range of information: that stuff doesn't even leave the server.
> And when I'm done with it, I nuke it so it's not even *there*.
>
> I hope that clarifies what's happening here. If you have specific
> questions about what we keep that's obviously more of a question for
> management.
>
> On 17 October 2014 12:27, Jonathan Morgan <jmorgan(a)wikimedia.org>
> wrote:
>
>> Pine, have you considered asking Milowent who they work with on the
>> IP data? I really, really doubt that there is some sort of shady back-alley
>> data dealing going down here. - Jonathan
>>
>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <wiki.pine(a)gmail.com> wrote:
>>
>>> Thanks Toby.
>>>
>>> I understand that IPs are not an especially accurate way to look at
>>> unique visitors, but for the purposes of the Signpost's traffic report
and
>>> the Top 25 I feel that they are reasonable approximations of ways to filter
>>> out what appear to be automated requests.
>>>
>>> I am ok with holding those logs for 30 days, although I am a little
>>> surprised to hear that this is happening. However, what worries me a bit
>>> more is the idea that a staff member can be accessing those logs without
>>> that access being recorded. This might be something that you wish to
>>> investigate further.
>>>
>>> I am not interested in getting this staff person into trouble. The
>>> information that they are providing is useful to the Signpost and certainly
>>> seems to be sanitized to a reasonable degree. However, it does concern me
>>> that they can access these logs without someone knowing about it, it seems
>>> to me that this sort of activity should be proactively disclosed to people
>>> in WMF who conduct legal and security reviews, and I hope you will consider
>>> what sort of security features are appropriate to make sure that occasions
>>> when anyone accesses the raw logs are recorded in a robust manner. I worry
>>> that if this one staffer can access logs without the higher-ups knowing
>>> about it, it is possible that someone who intends to do unethical
>>> activities with WMF's data could also access the logs without being
noticed.
>>>
>>> Thanks,
>>>
>>> Pine
>>>
>>>
>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin <tnegrin(a)wikimedia.org>
>>> wrote:
>>>
>>>> Hi Pine --
>>>>
>>>> Thanks for this -- it's a challenging topic but one that the
>>>> Analytics team takes very seriously.
>>>>
>>>> I'm not familiar with the IP address review that's referenced in
>>>> the link. I don't know who the staffer might be. We don't
currently
>>>> calculate unique visitors to anything in Analytics and IP address is not
a
>>>> particularly accurate way to assess unique visitors regardless (due to
>>>> proxies/NATs/etc).
>>>>
>>>> We do store IPs as part of page requests in our raw logs which are
>>>> deleted every 30 days. This data is kept on a system where access is
>>>> limited and controlled by the operations team. We're in line with
the
>>>> privacy policy on this.
>>>>
>>>> To be clear, we are currently considering mechanisms to count
>>>> unique "requests" -- we rely on Comscore for this data and for
several
>>>> reasons, primarily related to mobile usage, it's not sufficient to
>>>> understand our usage patterns. We are putting together some proposals to
do
>>>> this in as limited way as possible and that's respectful to our
users.
>>>> We'll share this with the community when we feel we understand the
use
>>>> cases and trade-offs well enough to discuss in an informed manner.
>>>>
>>>> -Toby
>>>>
>>>>
>>>>
>>>> We do store the IP address associated with varnish requests as part
>>>> of the log. This data is
>>>>
>>>>
>>>>
>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W <wiki.pine(a)gmail.com>
>>>> wrote:
>>>>
>>>>> Hi again Analytics,
>>>>>
>>>>> I was under the impression that no records are kept of which IPs
>>>>> access which articles on Wikipedia when no edits are made, but it
appears
>>>>> that such records are in fact kept [1].
>>>>>
>>>>> Is this proper? This practice appears to be permissible under the
>>>>> Privacy Policy which states that "We use IP addresses for
research and
>>>>> analytics; to better personalize content, notices, and settings for
you; to
>>>>> fight spam, identity theft, malware, and other kinds of abuse; and
to
>>>>> provide better mobile and other applications."
>>>>>
>>>>> It is possible that this information is relevant for determining
>>>>> the number of unique visitors that Wikipedia gets and that this
information
>>>>> is always properly filtered before it gets to the Signpost. However,
given
>>>>> recent discussions which I thought said that Wikipedia was not
instrumented
>>>>> to track unique visitors, I am surprised to learn that this already
seems
>>>>> to be happening and that the situation has been this way for some
time, so
>>>>> I would appreciate clarification.
>>>>>
>>>>> I want to emphasize that this question is about clarifying the
>>>>> practice of tracking likely unique visitors by IP. This question is
not
>>>>> intended to start flame wars, get people into trouble, or limit the
>>>>> Signpost's access to properly filtered information if there has
been a
>>>>> determination that WMF's retention of the raw data is
appropriate. There
>>>>> might be appropriate secondary questions about making sure that
access to
>>>>> the raw IP access data is carefully contained and secured.
>>>>>
>>>>> Thank you very much,
>>>>>
>>>>> Pine
>>>>>
>>>>> [1]
>>>>>
https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&di…
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> Jonathan T. Morgan
>> Learning Strategist
>> Wikimedia Foundation
>> User:Jmorgan (WMF)
>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>> jmorgan(a)wikimedia.org
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org