Removing raw user agent strings from EventLogging (was: Parsing user agents in EventLogging data) - Analytics

20 Mar 2017

As an update on this thread, the Analytics team is now closing in on making
this happen and providing parsed data like OS and browser directly in the
EventLogging capsule. \o/

It is also planned to remove the raw user agent entirely from the data. So
anyone who has been depending on this for their analysis should make sure
that they can work with the parsed data, and rewrite their queries (cf.
discussion at https://phabricator.wikimedia.org/T153207 ; the Performance
team has already gone through this work:
https://phabricator.wikimedia.org/T156760 ). If I'm reading
https://phabricator.wikimedia.org/T160454 correctly, it will also involve
renaming of existing EL tables.

On Wed, Dec 14, 2016 at 8:25 AM, Tilman Bayer &lt;tbayer(a)wikimedia.org&gt; wrote:

...
  Thanks to all who responded in this thread!

 I have now filed a Phabricator task for augmenting the EventLogging
 capsule with this kind of pre-parsed data alongside the existing raw user
 agent field: https://phabricator.wikimedia.org/T153207
 (Also, as Nuria recalled on Phabricator at the time, this would in
 addition help to address the open issue of filtering out spiders in EL:
 https://phabricator.wikimedia.org/T121550 )

 Back in September I ended up continuing to use an rather simplistic regex
 in MySQL for the task at hand (restricting to Firefox UAs to mitigate
 T146840), but the experience only confirmed that it would be much better to
 have browser family etc. detected by the ua-parser library.

 On Thu, Sep 15, 2016 at 10:37 PM, Nuria Ruiz &lt;nuria(a)wikimedia.org&gt; wrote:

 I think we can also probably consider doing the parsing in EL/MySQL so
 the user agent is never raw on tables but rather always parsed. We could
 use the python ua parser library and results should be identical to the
 ones we have on Hive.

 Thanks,

 Nuria

 On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto &lt;otto(a)wikimedia.org&gt; wrote:

> I’ve added an example to https://wikitech.wikimedia.
> org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and
> the Hive get_json_object function to work with a user_agent_map.
>
> Unfortunately we can’t manage tables in Hive for every EventLogging
> schema/revision like we do in MySQL.  So, you have to create your own
> table. It *should* be possible to specify the schema and use
> the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.
>
> Hope that helps!
>
> On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns &lt;mforns(a)wikimedia.org
> > wrote:
>
>> Just a heads up:
>>
>> user_agent field is a PII field (privacy sensitive), and as such it is
>> purged after 90 days. If there would be a user_agent_map field, it should
>> be purged after 90 days as well.
>>
>> Another more permanent option might be to detect the browser family on
>> the JavaScript client with i.e. duck-typing[1] and send it as part of the
>> explicit schema. The browser family by itself is not identifying enough to
>> be considered PII, and could be kept indefinitely.
>>
>> [1] http://stackoverflow.com/questions/9847580/how-to-detect
>> -safari-chrome-ie-firefox-and-opera-browser
>>
>> On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell &lt;jane023(a)gmail.com&gt;
>> wrote:
>>
>>> It's not just a question of which value to choose, but also how to
>>> sort. It would be nice to be able to choose sorting in alphabetical order
>>> vs numerical order. It would also be nice to assign a default sort to any
>>> item label that is taken from the Wikipedia {{DEFAULTSORT}} template
>>> (though that won't work for items without a Wikipedia article).
>>>
>>> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <
>>> dandreescu(a)wikimedia.org&gt; wrote:
>>>
>>>> The problem with working on EL data in hive is that the schemas for
>>>> the tables can change at any point, in backwards-incompatible ways.  And
>>>> maintaining tables dynamically is harder here than in mysql world (where
EL
>>>> just tries to insert, and creates the table on failure).  So, while
it's
>>>> relatively easy to use ua-parser (see below), you can't easily access
EL
>>>> data in hive tables.  However, we do have all EL data in hadoop, so you
can
>>>> access it with Spark.  Andrew's about to answer with more details on
that.
>>>> I just thought this might be useful if you sqoop EL data from mysql or
>>>> otherwise import it into a Hive table:
>>>>
>>>>
>>>> from stat1002, start hive, then:
>>>>
>>>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a
>>>> nalytics/refinery/refinery-hive-0.0.35.jar;
>>>>
>>>> CREATE TEMPORARY FUNCTION ua_parser as
'org.wikimedia.analytics.refin
>>>> ery.hive.UAParserUDF';
>>>>
>>>> select ua_parser('Wikimedia Bot');
>>>>
>>>>  The question was actually about doing UA parsing in MySQL
directly, but I
 did appreciate the additional information about Hadoop-based options (even
 though they don't cover many use cases that will be addressed by storing
 parsed data directly in EL tables).
 To go off on a tangent for a bit: Accessing EL data in Hadoop is an
 interesting topic, but as far as I know it has not been done widely before.
 I have been interested in it for a while (maInly because of the performance
 issues with some large EL tables in MariaDB) and in January Marcel walked
 me through the steps at https://wikitech.wikimedia.org
 /wiki/Analytics/EventLogging#Hadoop and we successfully imported one
 partition (one hour's worth of data). But we got stuck at the question of
 how to merge separately imported partitions, and I still don't see that
 addressed in the documentation. I understand that's a separate problem from
 the schema versioning issues discussed in this thread. Also, while the
 Spark option sounds cool, it would involve learning an entirely new tool
 and workflow for myself and other analysts (IIRC Oliver also made that
 point when we discussed this on IRC earlier this year with the team).

 --
 Tilman Bayer
 Senior Analyst
 Wikimedia Foundation
 IRC (Freenode): HaeB

-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB