So it's distinct people, globally - and I
deliberately made it wooly
it by operating over username, which means the threshold is fuzzy
(i.e., at a minimum it's 50. At a maximum it's 50x[number of wikis]).
It's very deliberately dimension-free: user_agent,
edit_count_in_non_specified_90_day_period, and that's it.
On 4 March 2015 at 17:12, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:
Assuming this was public, I could use this data
on seldom edited Wikis to
find out which editors likely have old browser/OS versions with
vulnerabilities that I could attack[1]. This would be easier and easier the
more dimensions you add to the data.
<re-reads>
OK. The anonymization strategy for dropping records that represent < 50
distinct editors seems to address this concern. 50 edits is a lot. So
this data wouldn't be too terribly useful for under-active wikis. Then
again, if you just want to a sense for what the dominant browser/OS pairs
are, then they will likely represent > 50 unique editors on most projects.
1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
implications of that one.
On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
>
> Yeah, makes sense.
>
> On 3 March 2015 at 20:38, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>>> Agreed. Do we have a way of syncing files to Labs yet?
>> No need to sync if file is available in an endpoint like
>> htpp://some-data-here
>>
>> On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes <okeyes(a)wikimedia.org>
>> wrote:
>>>
>>> On 3 March 2015 at 19:35, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>>>>> Erik has asked me to write an exploratory app for user-agent data.
>>>>> The
>>>>> idea is to enable Product Managers and engineers to easily explore
>>>>> what users use so they know what to support. I've thrown up an
>>>>> example
>>>>> screenshot at
http://ironholds.org/agents_example_screen.png
>>>>
>>>> I cannot speak as to the interest of community about this data but
>>>> for
>>>> developers and PM we should make sure we have a solid way to update
>>>> any
>>>> data
>>>> we put up. User Agent data is outdated as soon as a new version of
>>>> android
>>>> or iOs is released, a new popular phone comes along or a new
>>>> autoupdate
>>>> for
>>>> popular browsers. Not only that, if we make changes to, say, redirect
>>>> all
>>>> iPad users to the desktop site we want to asses effect of those
>>>> changes
>>>> as
>>>> soon as possible. A monthly update will be a must. Also
>>>> distinguishing
>>>> between browser percentages on desktop site versus mobile site versus
>>>> apps
>>>> is a must for this data to be real useful for PMs and developers
>>>> (specially
>>>> for bug triage).
>>>>
>>>
>>> Yes! However, I am addressing a specific ad-hoc request. If there is a
>>> need for this (I agree there is) I hope Toby and Kevin can eke out the
>>> time on the Analytics Engineering schedule to work on it; y'all are a
>>> lot better at infrastructure work than me :).
>>>
>>>>
>>>> We have couple backlog items to make monthly reports on this regard.
>>>> A
>>>> UI on
>>>> top of them will be superb.
>>>>
>>>
>>> Agreed. Do we have a way of syncing files to Labs yet? That's the
>>> biggest blocker. The UI doesn't care what the file contains as long as
>>> it's a TSV with a header row - I've deliberately built it so that
>>> things like the download links are dynamic and can change.
>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes
<okeyes(a)wikimedia.org>
>>>> wrote:
>>>>>
>>>>> Hey all,
>>>>>
>>>>> (Sending this to the public list because it's more transparent
and
>>>>> I'd
>>>>> like people who think this data is useful to be able to shout out)
>>>>>
>>>>> Erik has asked me to write an exploratory app for user-agent data.
>>>>> The
>>>>> idea is to enable Product Managers and engineers to easily explore
>>>>> what users use so they know what to support. I've thrown up an
>>>>> example
>>>>> screenshot at
http://ironholds.org/agents_example_screen.png
(I'd
>>>>> host it on Commons, inb4Dario, but I'm not sure the copyright
status
>>>>> of the UI)
>>>>>
>>>>> One side-effect of this is that we end up with files of common user
>>>>> agents, split between {readers,editors} and {mobile, desktop},
>>>>> parsed
>>>>> and unparsed. I'd like to release these files. The reuse
potential
>>>>> is
>>>>> twofold; researchers and engineers can use the parsed files to see
>>>>> what browser penetration looks like globally and what browsers
>>>>> should
>>>>> be supported at a top-10, and software engineers can use the
>>>>> unparsed
>>>>> files to improve detection rates.
>>>>>
>>>>> The privacy implications /should/ be minimal, because of how this
>>>>> data
>>>>> is gathered. The editor data is gathered from the checkuser table,
>>>>> globally, and automatically excludes any user agent used by fewer
>>>>> than
>>>>> 50 distinct usernames. The reader data is gathered from a month of
>>>>> 1:1000 sampled log files, and excludes any agent responsible for
>>>>> fewer
>>>>> than 500 pageviews in a 24 hour period (except, sampled. So,
>>>>> practically speaking, that's 500,000 pageviews)
>>>>>
>>>>> What do people think about making this a data release? Would people
>>>>> get value from the data, as well as the tool?
>>>>>
>>>>> --
>>>>> Oliver Keyes
>>>>> Research Analyst
>>>>> Wikimedia Foundation
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics