Thanks very much, Toby and everyone.
Ironholds, I appreciate your doing traffic research on a volunteer basis
for the benefit of the Signpost and the community. I'm concerned about the
system as a whole may need a closer look, and I'm glad that Toby will be
doing this with input from Legal.
Toby: I hope we can continue to get some Ironholds-sponsored filtering
for the Traffic Report, although we may need to get it with some additional
conditions attached.
Thanks and regards,
Pine
On Fri, Oct 17, 2014 at 3:20 PM, Toby Negrin <tnegrin(a)wikimedia.org>
wrote:
Folks --
While I'm pleased that this validation was being done by a team member
with full knowledge of our privacy and data retention policies, I think
some good points have been raised that we're going to need to discuss as a
team. I've reached out to legal for their assistance is figuring out the
path forward.
-Toby
On Fri, Oct 17, 2014 at 3:16 PM, Dan Andreescu <dandreescu(a)wikimedia.org
wrote:
> I see - Oliver's batman. Nothing to see here, moving on.
>
> On Fri, Oct 17, 2014 at 4:58 PM, Oliver Keyes <okeyes(a)wikimedia.org>
> wrote:
>
>> I should also point out that "Toby not knowing who the staffer doing
>> this one, highly specific, very minor piece of data-dogging is" does not
>> equate to analytics not knowing who it is. I don't know what you do for a
>> living but do you tend to give your boss's boss a constant play-by-play,
>> or? ;p. It's documented in Trello just like everything else.
>>
>> On 17 October 2014 16:55, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
>>
>>> It's me. Hi! I'm sort of confused by this.
>>>
>>> In terms of shady back-alley data dealing, let me set out exactly
>>> what happens.
>>>
>>> Every week, the signpost emails me a list of articles that have
>>> unexpectedly high pageview counts and would be in the top 25, but nobody
>>> can quite work out why they're so popular. I go through the logs for the
>>> last week (I'd be unable to do this for any queries more than a month
ago
>>> anyway, since we only keep the unsampled data for that long, but a week is
>>> what's relevant here), and pull out a tuple of {ip,referer,user
>>> agent,article, requests} for the articles on that list.
>>>
>>> These tuples, which exist exclusively on our analytics machines (not
>>> even my personal, encrypted work laptop: they're only stored
server-side,
>>> at all steps in this) are than hand-parsed by me. Can we pin all of the
>>> requests for [article], or at least most of them, on a single IP address,
>>> or a single {IP,user_agent} pair? Then it's probably a spammer or a
spider
>>> or an [expletive]. No? Okay, if we sum by referer, do we see a common
>>> referer? If so, is that an actual referer or a fly-by-night live mirror?
>>> Questions like that.
>>>
>>> When I'm done with all of the articles, I email the signpost with
>>> "for article1, that looks legit. Article2 is a web crawler I'm going
to
>>> email and shout at. Article3 is a live mirror. Article4 looks legit.
>>> Article5...". These requests are logged on our trello board, just like
any
>>> other data request from any other party, community or staff. Milowent and
>>> the other signposters get zero IPs, zero user agents, and nothing anywhere
>>> near that range of information: that stuff doesn't even leave the
server.
>>> And when I'm done with it, I nuke it so it's not even *there*.
>>>
>>> I hope that clarifies what's happening here. If you have specific
>>> questions about what we keep that's obviously more of a question for
>>> management.
>>>
>>> On 17 October 2014 12:27, Jonathan Morgan <jmorgan(a)wikimedia.org>
>>> wrote:
>>>
>>>> Pine, have you considered asking Milowent who they work with on the
>>>> IP data? I really, really doubt that there is some sort of shady
back-alley
>>>> data dealing going down here. - Jonathan
>>>>
>>>> On Thu, Oct 16, 2014 at 9:52 PM, Pine W <wiki.pine(a)gmail.com>
wrote:
>>>>
>>>>> Thanks Toby.
>>>>>
>>>>> I understand that IPs are not an especially accurate way to look at
>>>>> unique visitors, but for the purposes of the Signpost's traffic
report and
>>>>> the Top 25 I feel that they are reasonable approximations of ways to
filter
>>>>> out what appear to be automated requests.
>>>>>
>>>>> I am ok with holding those logs for 30 days, although I am a little
>>>>> surprised to hear that this is happening. However, what worries me a
bit
>>>>> more is the idea that a staff member can be accessing those logs
without
>>>>> that access being recorded. This might be something that you wish to
>>>>> investigate further.
>>>>>
>>>>> I am not interested in getting this staff person into trouble. The
>>>>> information that they are providing is useful to the Signpost and
certainly
>>>>> seems to be sanitized to a reasonable degree. However, it does
concern me
>>>>> that they can access these logs without someone knowing about it, it
seems
>>>>> to me that this sort of activity should be proactively disclosed to
people
>>>>> in WMF who conduct legal and security reviews, and I hope you will
consider
>>>>> what sort of security features are appropriate to make sure that
occasions
>>>>> when anyone accesses the raw logs are recorded in a robust manner. I
worry
>>>>> that if this one staffer can access logs without the higher-ups
knowing
>>>>> about it, it is possible that someone who intends to do unethical
>>>>> activities with WMF's data could also access the logs without
being noticed.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Pine
>>>>>
>>>>>
>>>>> On Thu, Oct 16, 2014 at 9:31 PM, Toby Negrin
<tnegrin(a)wikimedia.org
>>>>>
wrote:
>>>>>
>>>>>> Hi Pine --
>>>>>>
>>>>>> Thanks for this -- it's a challenging topic but one that the
>>>>>> Analytics team takes very seriously.
>>>>>>
>>>>>> I'm not familiar with the IP address review that's
referenced in
>>>>>> the link. I don't know who the staffer might be. We don't
currently
>>>>>> calculate unique visitors to anything in Analytics and IP address
is not a
>>>>>> particularly accurate way to assess unique visitors regardless
(due to
>>>>>> proxies/NATs/etc).
>>>>>>
>>>>>> We do store IPs as part of page requests in our raw logs which
are
>>>>>> deleted every 30 days. This data is kept on a system where access
is
>>>>>> limited and controlled by the operations team. We're in line
with the
>>>>>> privacy policy on this.
>>>>>>
>>>>>> To be clear, we are currently considering mechanisms to count
>>>>>> unique "requests" -- we rely on Comscore for this data
and for several
>>>>>> reasons, primarily related to mobile usage, it's not
sufficient to
>>>>>> understand our usage patterns. We are putting together some
proposals to do
>>>>>> this in as limited way as possible and that's respectful to
our users.
>>>>>> We'll share this with the community when we feel we
understand the use
>>>>>> cases and trade-offs well enough to discuss in an informed
manner.
>>>>>>
>>>>>> -Toby
>>>>>>
>>>>>>
>>>>>>
>>>>>> We do store the IP address associated with varnish requests as
>>>>>> part of the log. This data is
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 16, 2014 at 8:50 PM, Pine W
<wiki.pine(a)gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi again Analytics,
>>>>>>>
>>>>>>> I was under the impression that no records are kept of which
IPs
>>>>>>> access which articles on Wikipedia when no edits are made,
but it appears
>>>>>>> that such records are in fact kept [1].
>>>>>>>
>>>>>>> Is this proper? This practice appears to be permissible under
the
>>>>>>> Privacy Policy which states that "We use IP addresses
for research and
>>>>>>> analytics; to better personalize content, notices, and
settings for you; to
>>>>>>> fight spam, identity theft, malware, and other kinds of
abuse; and to
>>>>>>> provide better mobile and other applications."
>>>>>>>
>>>>>>> It is possible that this information is relevant for
determining
>>>>>>> the number of unique visitors that Wikipedia gets and that
this information
>>>>>>> is always properly filtered before it gets to the Signpost.
However, given
>>>>>>> recent discussions which I thought said that Wikipedia was
not instrumented
>>>>>>> to track unique visitors, I am surprised to learn that this
already seems
>>>>>>> to be happening and that the situation has been this way for
some time, so
>>>>>>> I would appreciate clarification.
>>>>>>>
>>>>>>> I want to emphasize that this question is about clarifying
the
>>>>>>> practice of tracking likely unique visitors by IP. This
question is not
>>>>>>> intended to start flame wars, get people into trouble, or
limit the
>>>>>>> Signpost's access to properly filtered information if
there has been a
>>>>>>> determination that WMF's retention of the raw data is
appropriate. There
>>>>>>> might be appropriate secondary questions about making sure
that access to
>>>>>>> the raw IP access data is carefully contained and secured.
>>>>>>>
>>>>>>> Thank you very much,
>>>>>>>
>>>>>>> Pine
>>>>>>>
>>>>>>> [1]
>>>>>>>
https://en.wikipedia.org/w/index.php?title=User_talk%3ASerendipodous&di…
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> Analytics(a)lists.wikimedia.org
>>>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics(a)lists.wikimedia.org
>>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jonathan T. Morgan
>>>> Learning Strategist
>>>> Wikimedia Foundation
>>>> User:Jmorgan (WMF)
>>>> <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
>>>> jmorgan(a)wikimedia.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org