Hi all,
the webrequest and pageview_hourly tables on Hive contain the very
useful user_agent_map field, which stores the following data extracted
from the raw user agent (still available as a separate field):
device_family, browser_family, browser_major, os_family, os_major,
os_minor and wmf_app_version. (The Analytics Engineering team has
built a dashboard that uses this data and last month published a
popular blog post about it.) I understand it is mainly based on the
ua-parser library (
http://www.uaparser.org/ ) .
In contrast, the event capsule in our EventLogging tables only
contains the raw, unparsed user agent.
* Does anyone on this list have experience in parsing user agents in
EventLogging data for the purpose of detecting browser family, version
etc, and would like to share advice on how to do this most
efficiently? (In the past, I have written some expressions in MySQL to
extract the app version number for the Wikipedia apps. But it seems a
bit of a pain to do that for classifying browsers in general. One
option would be to export the data and use the Python version of
ua-parser, however doing it directly in MySQL would fit better into
existing workflows.)
* Assuming it is technically possible to add such a pre-parsed
user_agent_map field to the EventLogging tables, would other analysts
be interested in using it too?
This came up recently with the Reading web team, for the purpose of
investigating whether certain issues are caused by certain browsers
only. But I imagine it has arisen in other places as well.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB