Parsing user agents in EventLogging data

List overview All Threads
Download

newer

older

Investigating frequency of section...

stats.wikimedia.org (and other...

Tilman Bayer

15 Sep 2016 15 Sep '16

3:21 a.m.

Hi all,

the webrequest and pageview_hourly tables on Hive contain the very useful user_agent_map field, which stores the following data extracted from the raw user agent (still available as a separate field): device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version. (The Analytics Engineering team has built a dashboard that uses this data and last month published a popular blog post about it.) I understand it is mainly based on the ua-parser library (http://www.uaparser.org/ ) .

In contrast, the event capsule in our EventLogging tables only contains the raw, unparsed user agent.

* Does anyone on this list have experience in parsing user agents in EventLogging data for the purpose of detecting browser family, version etc, and would like to share advice on how to do this most efficiently? (In the past, I have written some expressions in MySQL to extract the app version number for the Wikipedia apps. But it seems a bit of a pain to do that for classifying browsers in general. One option would be to export the data and use the Python version of ua-parser, however doing it directly in MySQL would fit better into existing workflows.)

* Assuming it is technically possible to add such a pre-parsed user_agent_map field to the EventLogging tables, would other analysts be interested in using it too?

This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

Show replies by date

Federico Leva (Nemo)

15 Sep 15 Sep

9:06 a.m.

Tilman Bayer, 15/09/2016 01:21:

...

This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

Definitely. https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization

Nemo

Dan Andreescu

6:18 p.m.

The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:

from stat1002, start hive, then:

ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.35.jar;

CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';

select ua_parser('Wikimedia Bot');

On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Tilman Bayer, 15/09/2016 01:21:

...
This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

Definitely. https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitiz ation

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jane Darnell

7:40 p.m.

It's not just a question of which value to choose, but also how to sort. It would be nice to be able to choose sorting in alphabetical order vs numerical order. It would also be nice to assign a default sort to any item label that is taken from the Wikipedia {{DEFAULTSORT}} template (though that won't work for items without a Wikipedia article).

On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:

from stat1002, start hive, then:

ADD JAR /srv/deployment/analytics/refinery/artifacts/org/ wikimedia/analytics/refinery/refinery-hive-0.0.35.jar;

CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics. refinery.hive.UAParserUDF';

select ua_parser('Wikimedia Bot');

On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
Tilman Bayer, 15/09/2016 01:21:

...
This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

Definitely. https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitiz ation

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Marcel Ruiz Forns

11:19 p.m.

Just a heads up:

user_agent field is a PII field (privacy sensitive), and as such it is purged after 90 days. If there would be a user_agent_map field, it should be purged after 90 days as well.

Another more permanent option might be to detect the browser family on the JavaScript client with i.e. duck-typing[1] and send it as part of the explicit schema. The browser family by itself is not identifying enough to be considered PII, and could be kept indefinitely.

[1] http://stackoverflow.com/questions/9847580/how-to-detect-safari-chrome-ie-fi...

On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell jane023@gmail.com wrote:

...

It's not just a question of which value to choose, but also how to sort. It would be nice to be able to choose sorting in alphabetical order vs numerical order. It would also be nice to assign a default sort to any item label that is taken from the Wikipedia {{DEFAULTSORT}} template (though that won't work for items without a Wikipedia article).

On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:

from stat1002, start hive, then:

ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/ analytics/refinery/refinery-hive-0.0.35.jar;

CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin ery.hive.UAParserUDF';

select ua_parser('Wikimedia Bot');

On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) <nemowiki@gmail.com

...
wrote:

...
Tilman Bayer, 15/09/2016 01:21:

...
This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

Definitely. https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitiz ation

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Andrew Otto

16 Sep 16 Sep

12:06 a.m.

I’ve added an example to https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and the Hive get_json_object function to work with a user_agent_map.

Unfortunately we can’t manage tables in Hive for every EventLogging schema/revision like we do in MySQL. So, you have to create your own table. It *should* be possible to specify the schema and use the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.

Hope that helps!

On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...

Just a heads up:

user_agent field is a PII field (privacy sensitive), and as such it is purged after 90 days. If there would be a user_agent_map field, it should be purged after 90 days as well.

Another more permanent option might be to detect the browser family on the JavaScript client with i.e. duck-typing[1] and send it as part of the explicit schema. The browser family by itself is not identifying enough to be considered PII, and could be kept indefinitely.

[1] http://stackoverflow.com/questions/9847580/how-to- detect-safari-chrome-ie-firefox-and-opera-browser

On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell jane023@gmail.com wrote:

...
It's not just a question of which value to choose, but also how to sort. It would be nice to be able to choose sorting in alphabetical order vs numerical order. It would also be nice to assign a default sort to any item label that is taken from the Wikipedia {{DEFAULTSORT}} template (though that won't work for items without a Wikipedia article).

On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu <dandreescu@wikimedia.org

...
wrote:

...
The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:

from stat1002, start hive, then:

ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a nalytics/refinery/refinery-hive-0.0.35.jar;

CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin ery.hive.UAParserUDF';

select ua_parser('Wikimedia Bot');

On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:

...
Tilman Bayer, 15/09/2016 01:21:

...
This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

Definitely. https://www.mediawiki.org/wiki /EventLogging/UserAgentSanitization

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

9:37 a.m.

I think we can also probably consider doing the parsing in EL/MySQL so the user agent is never raw on tables but rather always parsed. We could use the python ua parser library and results should be identical to the ones we have on Hive.

Thanks,

Nuria

On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto otto@wikimedia.org wrote:

...

I’ve added an example to https://wikitech.wikimedia.org/wiki/Analytics/ EventLogging#Hive on how to use the UAParserUDF and the Hive get_json_object function to work with a user_agent_map.

Unfortunately we can’t manage tables in Hive for every EventLogging schema/revision like we do in MySQL. So, you have to create your own table. It *should* be possible to specify the schema and use the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.

Hope that helps!

On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
Just a heads up:

user_agent field is a PII field (privacy sensitive), and as such it is purged after 90 days. If there would be a user_agent_map field, it should be purged after 90 days as well.

Another more permanent option might be to detect the browser family on the JavaScript client with i.e. duck-typing[1] and send it as part of the explicit schema. The browser family by itself is not identifying enough to be considered PII, and could be kept indefinitely.

[1] http://stackoverflow.com/questions/9847580/how-to-detect -safari-chrome-ie-firefox-and-opera-browser

On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell jane023@gmail.com wrote:

...
It's not just a question of which value to choose, but also how to sort. It would be nice to be able to choose sorting in alphabetical order vs numerical order. It would also be nice to assign a default sort to any item label that is taken from the Wikipedia {{DEFAULTSORT}} template (though that won't work for items without a Wikipedia article).

On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:

from stat1002, start hive, then:

ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a nalytics/refinery/refinery-hive-0.0.35.jar;

CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin ery.hive.UAParserUDF';

select ua_parser('Wikimedia Bot');

On Thu, Sep 15, 2016 at 1:06 AM, Federico Leva (Nemo) < nemowiki@gmail.com> wrote:

...
Tilman Bayer, 15/09/2016 01:21:

...
This came up recently with the Reading web team, for the purpose of investigating whether certain issues are caused by certain browsers only. But I imagine it has arisen in other places as well.

Definitely. https://www.mediawiki.org/wiki /EventLogging/UserAgentSanitization

Nemo

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Tilman Bayer

14 Dec 14 Dec

8:25 p.m.

Thanks to all who responded in this thread!

I have now filed a Phabricator task for augmenting the EventLogging capsule with this kind of pre-parsed data alongside the existing raw user agent field: https://phabricator.wikimedia.org/T153207 (Also, as Nuria recalled on Phabricator at the time, this would in addition help to address the open issue of filtering out spiders in EL: https://phabricator.wikimedia.org/T121550 )

Back in September I ended up continuing to use an rather simplistic regex in MySQL for the task at hand (restricting to Firefox UAs to mitigate T146840), but the experience only confirmed that it would be much better to have browser family etc. detected by the ua-parser library.

On Thu, Sep 15, 2016 at 10:37 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

I think we can also probably consider doing the parsing in EL/MySQL so the user agent is never raw on tables but rather always parsed. We could use the python ua parser library and results should be identical to the ones we have on Hive.

Thanks,

Nuria

On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto otto@wikimedia.org wrote:

...
I’ve added an example to https://wikitech.wikimedia. org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and the Hive get_json_object function to work with a user_agent_map.

Unfortunately we can’t manage tables in Hive for every EventLogging schema/revision like we do in MySQL. So, you have to create your own table. It *should* be possible to specify the schema and use the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.

Hope that helps!

On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:

...
Just a heads up:

user_agent field is a PII field (privacy sensitive), and as such it is purged after 90 days. If there would be a user_agent_map field, it should be purged after 90 days as well.

Another more permanent option might be to detect the browser family on the JavaScript client with i.e. duck-typing[1] and send it as part of the explicit schema. The browser family by itself is not identifying enough to be considered PII, and could be kept indefinitely.

[1] http://stackoverflow.com/questions/9847580/how-to-detect -safari-chrome-ie-firefox-and-opera-browser

On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell jane023@gmail.com wrote:

...
It's not just a question of which value to choose, but also how to sort. It would be nice to be able to choose sorting in alphabetical order vs numerical order. It would also be nice to assign a default sort to any item label that is taken from the Wikipedia {{DEFAULTSORT}} template (though that won't work for items without a Wikipedia article).

On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:

...
The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:

from stat1002, start hive, then:

ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a nalytics/refinery/refinery-hive-0.0.35.jar;

CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin ery.hive.UAParserUDF';

select ua_parser('Wikimedia Bot');

The question was actually about doing UA parsing in MySQL directly, but I did appreciate the additional information about Hadoop-based options (even though they don't cover many use cases that will be addressed by storing parsed data directly in EL tables). To go off on a tangent for a bit: Accessing EL data in Hadoop is an interesting topic, but as far as I know it has not been done widely before. I have been interested in it for a while (maInly because of the performance issues with some large EL tables in MariaDB) and in January Marcel walked me through the steps at https://wikitech.wikimedia.org /wiki/Analytics/EventLogging#Hadoop and we successfully imported one partition (one hour's worth of data). But we got stuck at the question of how to merge separately imported partitions, and I still don't see that addressed in the documentation. I understand that's a separate problem from the schema versioning issues discussed in this thread. Also, while the Spark option sounds cool, it would involve learning an entirely new tool and workflow for myself and other analysts (IIRC Oliver also made that point when we discussed this on IRC earlier this year with the team).

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

2944

Age (days ago)

3035

Last active (days ago)

analytics@lists.wikimedia.org

7 comments

7 participants

tags (0)

participants (7)

Andrew Otto
Dan Andreescu
Federico Leva (Nemo)
Jane Darnell
Marcel Ruiz Forns
Nuria Ruiz
Tilman Bayer