Re: [Analytics] [Discussion] User agent data releases

4 Mar 2015


      Assuming this was public, I could use this data on seldom edited Wikis to
find out which editors likely have old browser/OS versions with
vulnerabilities that I could attack[1].  This would be easier and easier
the more dimensions you add to the data.
<re-reads>
OK.  The anonymization strategy for dropping records that represent < 50
distinct editors seems to address this concern.   50 edits is a lot.  So
this data wouldn't be too terribly useful for under-active wikis.  Then
again, if you just want to a sense for what the dominant browser/OS pairs
are, then they will likely represent > 50 unique editors on most projects.
1. Props to Matt Flaschen and Dan Andreescu for helping me work through the
implications of that one.
On Tue, Mar 3, 2015 at 9:59 PM, Oliver Keyes okeyes@wikimedia.org wrote:
...
Yeah, makes sense.
On 3 March 2015 at 20:38, Nuria Ruiz nuria@wikimedia.org wrote:
...
...
Agreed. Do we have a way of syncing files to Labs yet?
No need to sync if file is available in an endpoint like
htpp://some-data-here
On Tue, Mar 3, 2015 at 4:50 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
...
...
On 3 March 2015 at 19:35, Nuria Ruiz nuria@wikimedia.org wrote:
...
...
Erik has asked me to write an exploratory app for user-agent data. The
idea is to enable Product Managers and engineers to easily explore
what users use so they know what to support. I've thrown up an example
screenshot at http://ironholds.org/agents_example_screen.png
I cannot speak as to the interest of community about this data but for
developers and PM we should make sure we have a solid way to update
any
...
...
...
data
we put up. User Agent data is outdated as soon as a new version of
android
or iOs is released, a new popular phone comes along or a new
autoupdate
...
...
...
for
popular browsers. Not only that, if we make changes to, say, redirect
all
iPad users to the desktop site we want to asses effect of those
changes
...
...
...
as
soon as possible. A monthly update will be a must. Also distinguishing
between browser percentages on desktop site versus mobile site versus
apps
is a must for this data to be real useful for PMs and developers
(specially
for bug triage).
Yes! However, I am addressing a specific ad-hoc request. If there is a
need for this (I agree there is) I hope Toby and Kevin can eke out the
time on the Analytics Engineering schedule to work on it; y'all are a
lot better at infrastructure work than me :).
...
We have couple backlog items to make monthly reports on this regard. A
UI on
top of them will be superb.
Agreed. Do we have a way of syncing files to Labs yet? That's the
biggest blocker. The UI doesn't care what the file contains as long as
it's a TSV with a header row - I've deliberately built it so that
things like the download links are dynamic and can change.
...
On Tue, Mar 3, 2015 at 1:05 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
...
Hey all,
(Sending this to the public list because it's more transparent and
I'd
...
...
...
...
like people who think this data is useful to be able to shout out)
Erik has asked me to write an exploratory app for user-agent data.
The
...
...
...
...
idea is to enable Product Managers and engineers to easily explore
what users use so they know what to support. I've thrown up an
example
...
...
...
...
screenshot at http://ironholds.org/agents_example_screen.png  (I'd
host it on Commons, inb4Dario, but I'm not sure the copyright status
of the UI)
One side-effect of this is that we end up with files of common user
agents, split between {readers,editors} and {mobile, desktop}, parsed
and unparsed. I'd like to release these files. The reuse potential is
twofold; researchers and engineers can use the parsed files to see
what browser penetration looks like globally and what browsers should
be supported at a top-10, and software engineers can use the unparsed
files to improve detection rates.
The privacy implications /should/ be minimal, because of how this
data
...
...
...
...
is gathered. The editor data is gathered from the checkuser table,
globally, and automatically excludes any user agent used by fewer
than
...
...
...
...
50 distinct usernames. The reader data is gathered from a month of
1:1000 sampled log files, and excludes any agent responsible for
fewer
...
...
...
...
than 500 pageviews in a 24 hour period (except, sampled. So,
practically speaking, that's 500,000 pageviews)
What do people think about making this a data release? Would people
get value from the data, as well as the tool?
--
Oliver Keyes
Research Analyst
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Discussion] User agent data releases