Hi,
apologies for the long pause since the last update.
In the week from 2014-11-10–2014-11-16 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* Talk about Kafka at Apache Kafka NYC user group * High-availability test for EventLogging's database failed * Upgrades of first machines from the cluster to trusty (details below)
Have fun, Christian
* Talk about Kafka at Apache Kafka NYC user group
Andrew gave a talk [1] about WMF's Kafka setup and challenges around it at the Apache Kafka NYC user group. That ended up getting in great feedback not only on the talk itself, but also on instrumenting Python within the Hadoop ecosystem. So it helped in more than one way :-)
* High-availability test for EventLogging's database failed
Ops are in process of moving the database that EventLogging writes to behind a high-availability proxy. A test for that failed [2] (a firewall has been getting in the way) and EventLogging could not write events to the database for ~20 minutes.
Ops fixed the firewalling, and we backfilled the database from the plain-file logs.
* Upgrades of first machines from the cluster to trusty
The first few machines got upgraded to trusty [3]. At first things were looking good. Only a minor issue with grub. But that could be worked around. During that week, things looked mostly smooth for the Trusty upgrade.
[1] Google Glass :-) recorded video of the first 45-minutes of the talk is at:
https://drive.google.com/folderview?id=0B0B2VcpkcY6wVFR3TFhIVEl5dW8&usp=...
(downloadable for everyone who signed in to Google :-/ If you know how to reformat that URL into a plan curl-able URL, please let me know) (I first thought that the video is missing audio, but audio is there. It's just very silent.)
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141113-EventLog...
[3] https://phabricator.wikimedia.org/T1200