As an update on this thread, the Analytics team is now closing in on making
this happen and providing parsed data like OS and browser directly in the
EventLogging capsule. \o/
It is also planned to remove the raw user agent entirely from the data. So
anyone who has been depending on this for their analysis should make sure
that they can work with the parsed data, and rewrite their queries (cf.
discussion at
). If I'm reading
correctly, it will also involve
renaming of existing EL tables.
On Wed, Dec 14, 2016 at 8:25 AM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
Thanks to all who responded in this thread!
I have now filed a Phabricator task for augmenting the EventLogging
capsule with this kind of pre-parsed data alongside the existing raw user
agent field:
https://phabricator.wikimedia.org/T153207
(Also, as Nuria recalled on Phabricator at the time, this would in
addition help to address the open issue of filtering out spiders in EL:
https://phabricator.wikimedia.org/T121550 )
Back in September I ended up continuing to use an rather simplistic regex
in MySQL for the task at hand (restricting to Firefox UAs to mitigate
T146840), but the experience only confirmed that it would be much better to
have browser family etc. detected by the ua-parser library.
On Thu, Sep 15, 2016 at 10:37 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
I think we can also probably consider doing the parsing in EL/MySQL so
the user agent is never raw on tables but rather always parsed. We could
use the python ua parser library and results should be identical to the
ones we have on Hive.
Thanks,
Nuria
On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto <otto(a)wikimedia.org> wrote:
> I’ve added an example to
https://wikitech.wikimedia.
> org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and
> the Hive get_json_object function to work with a user_agent_map.
>
> Unfortunately we can’t manage tables in Hive for every EventLogging
> schema/revision like we do in MySQL. So, you have to create your own
> table. It *should* be possible to specify the schema and use
> the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.
>
> Hope that helps!
>
> On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns <mforns(a)wikimedia.org
> > wrote:
>
>> Just a heads up:
>>
>> user_agent field is a PII field (privacy sensitive), and as such it is
>> purged after 90 days. If there would be a user_agent_map field, it should
>> be purged after 90 days as well.
>>
>> Another more permanent option might be to detect the browser family on
>> the JavaScript client with i.e. duck-typing[1] and send it as part of the
>> explicit schema. The browser family by itself is not identifying enough to
>> be considered PII, and could be kept indefinitely.
>>
>> [1]
http://stackoverflow.com/questions/9847580/how-to-detect
>> -safari-chrome-ie-firefox-and-opera-browser
>>
>> On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell <jane023(a)gmail.com>
>> wrote:
>>
>>> It's not just a question of which value to choose, but also how to
>>> sort. It would be nice to be able to choose sorting in alphabetical order
>>> vs numerical order. It would also be nice to assign a default sort to any
>>> item label that is taken from the Wikipedia {{DEFAULTSORT}} template
>>> (though that won't work for items without a Wikipedia article).
>>>
>>> On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <
>>> dandreescu(a)wikimedia.org> wrote:
>>>
>>>> The problem with working on EL data in hive is that the schemas for
>>>> the tables can change at any point, in backwards-incompatible ways. And
>>>> maintaining tables dynamically is harder here than in mysql world (where
EL
>>>> just tries to insert, and creates the table on failure). So, while
it's
>>>> relatively easy to use ua-parser (see below), you can't easily access
EL
>>>> data in hive tables. However, we do have all EL data in hadoop, so you
can
>>>> access it with Spark. Andrew's about to answer with more details on
that.
>>>> I just thought this might be useful if you sqoop EL data from mysql or
>>>> otherwise import it into a Hive table:
>>>>
>>>>
>>>> from stat1002, start hive, then:
>>>>
>>>> ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a
>>>> nalytics/refinery/refinery-hive-0.0.35.jar;
>>>>
>>>> CREATE TEMPORARY FUNCTION ua_parser as
'org.wikimedia.analytics.refin
>>>> ery.hive.UAParserUDF';
>>>>
>>>> select ua_parser('Wikimedia Bot');
>>>>
>>>>
The question was actually about doing UA parsing in MySQL
directly, but I
did appreciate the additional information about Hadoop-based options (even
though they don't cover many use cases that will be addressed by storing
parsed data directly in EL tables).
To go off on a tangent for a bit: Accessing EL data in Hadoop is an
interesting topic, but as far as I know it has not been done widely before.
I have been interested in it for a while (maInly because of the performance
issues with some large EL tables in MariaDB) and in January Marcel walked
me through the steps at
https://wikitech.wikimedia.org
/wiki/Analytics/EventLogging#Hadoop and we successfully imported one
partition (one hour's worth of data). But we got stuck at the question of
how to merge separately imported partitions, and I still don't see that
addressed in the documentation. I understand that's a separate problem from
the schema versioning issues discussed in this thread. Also, while the
Spark option sounds cool, it would involve learning an entirely new tool
and workflow for myself and other analysts (IIRC Oliver also made that
point when we discussed this on IRC earlier this year with the team).
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB