Thanks to all who responded in this thread!
I have now filed a Phabricator task for augmenting the EventLogging capsule with this kind of pre-parsed data alongside the existing raw user agent field: https://phabricator.wikimedia.org/T153207 (Also, as Nuria recalled on Phabricator at the time, this would in addition help to address the open issue of filtering out spiders in EL: https://phabricator.wikimedia.org/T121550 )
Back in September I ended up continuing to use an rather simplistic regex in MySQL for the task at hand (restricting to Firefox UAs to mitigate T146840), but the experience only confirmed that it would be much better to have browser family etc. detected by the ua-parser library.
On Thu, Sep 15, 2016 at 10:37 PM, Nuria Ruiz nuria@wikimedia.org wrote:
I think we can also probably consider doing the parsing in EL/MySQL so the user agent is never raw on tables but rather always parsed. We could use the python ua parser library and results should be identical to the ones we have on Hive.
Thanks,
Nuria
On Thu, Sep 15, 2016 at 1:06 PM, Andrew Otto otto@wikimedia.org wrote:
I’ve added an example to https://wikitech.wikimedia. org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and the Hive get_json_object function to work with a user_agent_map.
Unfortunately we can’t manage tables in Hive for every EventLogging schema/revision like we do in MySQL. So, you have to create your own table. It *should* be possible to specify the schema and use the org.apache.hive.hcatalog.data.JsonSerDe, but I haven’t tried this.
Hope that helps!
On Thu, Sep 15, 2016 at 3:19 PM, Marcel Ruiz Forns mforns@wikimedia.org wrote:
Just a heads up:
user_agent field is a PII field (privacy sensitive), and as such it is purged after 90 days. If there would be a user_agent_map field, it should be purged after 90 days as well.
Another more permanent option might be to detect the browser family on the JavaScript client with i.e. duck-typing[1] and send it as part of the explicit schema. The browser family by itself is not identifying enough to be considered PII, and could be kept indefinitely.
[1] http://stackoverflow.com/questions/9847580/how-to-detect -safari-chrome-ie-firefox-and-opera-browser
On Thu, Sep 15, 2016 at 5:40 PM, Jane Darnell jane023@gmail.com wrote:
It's not just a question of which value to choose, but also how to sort. It would be nice to be able to choose sorting in alphabetical order vs numerical order. It would also be nice to assign a default sort to any item label that is taken from the Wikipedia {{DEFAULTSORT}} template (though that won't work for items without a Wikipedia article).
On Thu, Sep 15, 2016 at 10:18 AM, Dan Andreescu < dandreescu@wikimedia.org> wrote:
The problem with working on EL data in hive is that the schemas for the tables can change at any point, in backwards-incompatible ways. And maintaining tables dynamically is harder here than in mysql world (where EL just tries to insert, and creates the table on failure). So, while it's relatively easy to use ua-parser (see below), you can't easily access EL data in hive tables. However, we do have all EL data in hadoop, so you can access it with Spark. Andrew's about to answer with more details on that. I just thought this might be useful if you sqoop EL data from mysql or otherwise import it into a Hive table:
from stat1002, start hive, then:
ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/a nalytics/refinery/refinery-hive-0.0.35.jar;
CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refin ery.hive.UAParserUDF';
select ua_parser('Wikimedia Bot');
The question was actually about doing UA parsing in MySQL directly, but I did appreciate the additional information about Hadoop-based options (even though they don't cover many use cases that will be addressed by storing parsed data directly in EL tables). To go off on a tangent for a bit: Accessing EL data in Hadoop is an interesting topic, but as far as I know it has not been done widely before. I have been interested in it for a while (maInly because of the performance issues with some large EL tables in MariaDB) and in January Marcel walked me through the steps at https://wikitech.wikimedia.org /wiki/Analytics/EventLogging#Hadoop and we successfully imported one partition (one hour's worth of data). But we got stuck at the question of how to merge separately imported partitions, and I still don't see that addressed in the documentation. I understand that's a separate problem from the schema versioning issues discussed in this thread. Also, while the Spark option sounds cool, it would involve learning an entirely new tool and workflow for myself and other analysts (IIRC Oliver also made that point when we discussed this on IRC earlier this year with the team).