Hi everyone,
We are now supporting Hive tables with EventLogging data!
This has been a long project https://phabricator.wikimedia.org/T162610. We finally feel comfortable enough to announce support for this method of querying EventLogging data. You can read documentation on how to access this data here:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_&a...
The ‘event’ database in Hive now contains tables for most EventLogging schemas, including both ‘analytics’ schemas https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 and some of the ‘EventBus’ schemas https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema .
Since Hive is a strongly typed system, there are limitations to what data can be imported from JSON. The job that imports this data https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/JsonRefine.scala does a bit of magic to infer the field types from the data itself. If your data is ever produced with a field that has multiple types (e.g. string vs object, integer vs. float, etc.), the import for the whole hour that the discrepancy is in will fail. Please be careful when you design your schemas and with your code that emits the events. We’ve recently been putting together some draft guidelines for new schemas https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines. Keep these in mind when you design new schemas :)
This is a new thing and it could be buggy! Please let us know if you encounter problems while using this data.
- Your friendly Analytics engineering team