Hi,
in the week from 2014-11-03–2014-11-09 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* More research around making xmldumps available in the Analytics cluster * 'research' database user * Alter data type of time_firstbyte, and adding Range header to webrequest table * Automatic cleanup of EventLogging logs on stat1002 and stat1003 * Per wiki CSVs with daily aggregates of webstatscollector numbers (details below)
Have fun, Christian
* More research around making xmldumps available in the Analytics cluster
In order make the xmldumps accessible easily from within the cluster, more research around WikiHadoop as InputFormat and Avro as serialization format was done. The proof of concept allowed to stream the xmldumps through WikiHadoop and write it into AVRO files. This approach would chunk the xmldumps into records containing the text (and metadata) for a revision and it's parent revision. Those records could be consumed directly from the cluster's processing platforms.
* 'research' database user
The needed code around making password changes of the 'research' database user got merged. So future password changes should be more frictionless. And the password was finally changed :-)
* Alter data type of time_firstbyte, and adding Range header to webrequest table
To be able to (at least mostly) disambiguate “seeking within a video file” from “starting to watch a video file” in the logs, we needed to add the Range header to the webrequest table.
Additionally, the data type of the time_firstbyte column needed to get changed.
While both migrations should just work according to Hive's documentation, testing them beforehand in labs showed that they “sometimes” screw up the table. So we prepared scripts to resurrect the table, if the migration blows up the table in the Analytics cluster.
We migrated the table. The table exploded (Bug 73095). And the prepared scripts helped to rebuild it within a few minutes.
Now, webrequest has the needed range header, a more granular data type for time_firstbyte, and all the partitions re-added.
* Automatic cleanup of EventLogging logs on stat1002 and stat1003
File logs of EventLogging data on stat1002 and stat1003 are now automatically cleaned up after 90 days as required by the the data retention guidelines.
* Per wiki CSVs with daily aggregates of webstatscollector numbers
In order to ease upcoming plotting of webstatscollector data in Dashiki, we wrote code to automatically aggregate webstatscollector's hourly projectcounts files into per wiki CSVs with daily numbers. And we backfilled with data back to 2008. The code and data is still under code-review.
Christian Aistleitner, 13/11/2014 00:59:
- Automatic cleanup of EventLogging logs on stat1002 and stat1003
File logs of EventLogging data on stat1002 and stat1003 are now automatically cleaned up after 90 days as required by the the data retention guidelines.
Great!
- Per wiki CSVs with daily aggregates of webstatscollector numbers
In order to ease upcoming plotting of webstatscollector data in Dashiki, we wrote code to automatically aggregate webstatscollector's hourly projectcounts files into per wiki CSVs with daily numbers. And we backfilled with data back to 2008. The code and data is still under code-review.
Interesting. https://gerrit.wikimedia.org/r/#/q/project:analytics/aggregator+OR+project:analytics/aggregator/data,n,z https://bugzilla.wikimedia.org/show_bug.cgi?id=72740
Nemo