Hi everyone,
We are now supporting Hive tables with EventLogging data!
This has been a long project <https://phabricator.wikimedia.org/T162610>. We
finally feel comfortable enough to announce support for this method of
querying EventLogging data. You can read documentation on how to access
this data here:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_&…
The ‘event’ database in Hive now contains tables for most EventLogging
schemas, including both ‘analytics’ schemas
<https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470>
and
some of the ‘EventBus’ schemas
<https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema>
.
Since Hive is a strongly typed system, there are limitations to what data
can be imported from JSON. The job that imports this data
<https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/JsonRefine.scala>
does a bit of magic to infer the field types from the data itself. If your
data is ever produced with a field that has multiple types (e.g. string vs
object, integer vs. float, etc.), the import for the whole hour that the
discrepancy is in will fail. Please be careful when you design your
schemas and with your code that emits the events. We’ve recently been
putting together some draft guidelines for new schemas
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines>.
Keep these in mind when you design new schemas :)
This is a new thing and it could be buggy! Please let us know if you
encounter problems while using this data.
- Your friendly Analytics engineering team