Hi,
in the week from 2014-11-24–2014-11-30 Andrew, and I [1] worked on the following items around the Analytics Cluster and Analytics related Ops:
* Catch-up and meetings around EventLogging issues. * EventLogging's database writer not properly shutting down * Wikipedia Zero graph comparability * Network switch outage in eqiad (details below)
Have fun, Christian
* Catch-up and meetings around EventLogging issues.
There were quite some catch-up discussions and meetings around the recent EventLogging issues. It seems were all on the same page now.
* EventLogging's database writer not properly shutting down
When having to adhoc increase EventLogging's database throughput, the hot fix was known to come with not too robust exit synchronization. So in case of issues, with the events, the database writer would not properly shut down and restart, but could be left hanging. This has been known beforehand, and was accepted to bring EventLogging up again as soon as possible.
The fix for it is not hard, but with the many follow-up meetings, it did not get deployed before the issue first struck [2]. Now with the follow-up meetings done, the fix got reviewed, deployed and is working fine up to now.
We backfilled the database from plain-file logs for the affected period.
* Wikipedia Zero graph comparability
Wikipedia Zero is moving from the Analytics team's dashboards to on-wiki graphs on the (private) zerowiki. But the numbers on the graphs did not match. So we helped to identify which aspects of the different pageview definitions cause the mismatches in the graphs. It seems that the key differences are now understood.
* Network switch outage in eqiad
During the weekend, a network switch in eqiad went offline [3] and took key machines in the analytics infrastructure offline. We started [4] looking at the affected machines, measuring impact and backfilling. This is not done yet and will take more time.
[1] Jeff will refocus on Ops projects outside the realm of Analytics. Many thanks for your great work on Analytics cluster and Analytics related Ops!
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141125-EventLog...
[3] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141130-Eqiad-Ra... https://phabricator.wikimedia.org/tag/incident-20141129-network/
[4] https://lists.wikimedia.org/pipermail/analytics/2014-November/002819.html https://lists.wikimedia.org/pipermail/analytics/2014-December/002821.html https://phabricator.wikimedia.org/T76334