[Analytics] Removing raw user agent strings from EventLogging (was: Parsing user agents in EventLogging data)

20 Mar 2017


      As an update on this thread, the Analytics team is now closing in on making
this happen and providing parsed data like OS and browser directly in the
EventLogging capsule. \o/
It is also planned to remove the raw user agent entirely from the data. So
anyone who has been depending on this for their analysis should make sure
that they can work with the parsed data, and rewrite their queries (cf.
discussion at https://phabricator.wikimedia.org/T153207 ; the Performance
team has already gone through this work:
https://phabricator.wikimedia.org/T156760 ). If I'm reading
https://phabricator.wikimedia.org/T160454 correctly, it will also involve
renaming of existing EL tables.
On Wed, Dec 14, 2016 at 8:25 AM, Tilman Bayer tbayer@wikimedia.org wrote:
...
Thanks to all who responded in this thread!
I have now filed a Phabricator task for augmenting the EventLogging
capsule with this kind of pre-parsed data alongside the existing raw user
agent field: https://phabricator.wikimedia.org/T153207
(Also, as Nuria recalled on Phabricator at the time, this would in
addition help to address the open issue of filtering out spiders in EL:
https://phabricator.wikimedia.org/T121550 )
Back in September I ended up continuing to use an rather simplistic regex
in MySQL for the task at hand (restricting to Firefox UAs to mitigate
T146840), but the experience only confirmed that it would be much better to
have browser family etc. detected by the ua-parser library.
On Thu, Sep 15, 2016 at 10:37 PM, Nuria Ruiz nuria@wikimedia.org wrote:
...
I think we can also probably consider doing the parsing in EL/MySQL so
the user agent is never raw on tables but rather always parsed. We could
use the python ua parser library and results should be identical to the
ones we have on Hive.
Thanks,
Nuria
On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto otto@wikimedia.org wrote:
...
I’ve added an example to https://wikitech.wikimedia.
org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and
the Hive get_json_object function to work with a user_agent_map.
Unfortunately we can’t manage tables in Hive for every EventLogging
schema/revision like we do in MySQL.  So, you have to create your own
table. It *should* be possible to specify the schema and use
the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.
Hope that helps!
On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns <mforns@wikimedia.org
...
wrote:
...
Just a heads up:
user_agent field is a PII field (privacy sensitive), and as such it is
purged after 90 days. If there would be a user_agent_map field, it should
be purged after 90 days as well.
Another more permanent option might be to detect the browser family on
the JavaScript client with i.e. duck-typing[1] and send it as part of the
explicit schema. The browser family by itself is not identifying enough to
be considered PII, and could be kept indefinitely.
[1] http://stackoverflow.com/questions/9847580/how-to-detect
-safari-chrome-ie-firefox-and-opera-browser
On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell jane023@gmail.com
wrote:
...
It's not just a question of which value to choose, but also how to
sort. It would be nice to be able to choose sorting in alphabetical order
vs numerical order. It would also be nice to assign a default sort to any
item label that is taken from the Wikipedia {{DEFAULTSORT}} template
(though that won't work for items without a Wikipedia article).
On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <
dandreescu@wikimedia.org> wrote:
...
The problem with working on EL data in hive is that the schemas for
the tables can change at any point, in backwards-incompatible ways.  And
maintaining tables dynamically is harder here than in mysql world (where EL
just tries to insert, and creates the table on failure).  So, while it's
relatively easy to use ua-parser (see below), you can't easily access EL
data in hive tables.  However, we do have all EL data in hadoop, so you can
access it with Spark.  Andrew's about to answer with more details on that.
I just thought this might be useful if you sqoop EL data from mysql or
otherwise import it into a Hive table:
from stat1002, start hive, then:
ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a
nalytics/refinery/refinery-hive-0.0.35.jar;
CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin
ery.hive.UAParserUDF';
select ua_parser('Wikimedia Bot');
The question was actually about doing UA parsing in MySQL directly, but I
did appreciate the additional information about Hadoop-based options (even
though they don't cover many use cases that will be addressed by storing
parsed data directly in EL tables).
To go off on a tangent for a bit: Accessing EL data in Hadoop is an
interesting topic, but as far as I know it has not been done widely before.
I have been interested in it for a while (maInly because of the performance
issues with some large EL tables in MariaDB) and in January Marcel walked
me through the steps at https://wikitech.wikimedia.org
/wiki/Analytics/EventLogging#Hadoop and we successfully imported one
partition (one hour's worth of data). But we got stuck at the question of
how to merge separately imported partitions, and I still don't see that
addressed in the documentation. I understand that's a separate problem from
the schema versioning issues discussed in this thread. Also, while the
Spark option sounds cool, it would involve learning an entirely new tool
and workflow for myself and other analysts (IIRC Oliver also made that
point when we discussed this on IRC earlier this year with the team).
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

[Analytics] Removing raw user agent strings from EventLogging (was: Parsing user agents in EventLogging data)