Hi,
in the week from 2014-11-17–2014-11-23 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops:
* EventLogging hit throughput limit to database * Unintended EventLogging deploy of faulty code * Outage on master of EventLogging's database shard (db1020) * Outage on master of EventLogging's database shard (db1046) * Debugging Mobile UI dashboard * Upgrades of first machines from the cluster to trusty * Discussions with researchers on how they could take advantage of the cluster * Allow multiple varnishkafkas on caches (details below)
Have fun, Christian
* EventLogging throughput limit to database
One of the teams instrumenting EventLogging silently and drastically ;-) increased the volume of events they are producing [1]. The total volume of events that the EventLogging infrastructure had to handle jumped from ~140 msgs/s to ~220 msgs/s. This was more than EventLogging's database writer could bring to the database. Only ~70% of events made it to the database. We isolated the issue, overcame the throughput limitation of EventLogging's database writer and the database writer (not EventLogging as a whole) can now handle way more events.
In addition to that, the database got backfilled from the plain-file logs.
* Unintended EventLogging deploy of faulty code
It seems that during efforts to bring EventLogging up-to-date on the beta cluster, faulty code unintendedly got deployed to production [2]. With this faulty code base, EventLogging's database writer crashed several times. Known good code got deployed again, and the database got back-filled from plain-file logs.
* Outage on master of EventLogging's database shard (db1020)
m2-master's mysqld process aborted [3] and hence EventLogging had no database to write to it. Ops quickly failed-over to the slave db1046, and thereby addressed the issue, and we backfilled the EventLogging database from plain-file logs.
* Outage on master of EventLogging's database shard (db1046)
Shortly after m2-master got failed over to db1046 (see above), db1046 had issues around its threadpool [4]. EventLogging could not connect to the database, and consequently could not write events to it. Ops quickly fixed the issue, and we backfilled the EventLogging database from plain-file logs.
* Debugging Mobile UI dashboard
The mobile UI dashboard was having issues, and since it is based on EventLogging data, people assumed that the dashboard issues are caused by EventLogging's issues. We helped to debug the dashboard, and point people to the real issue. EventLogging was not the culprit. Regardless, the relevant graph [5] is working again.
* Upgrades of first machines from the cluster to trusty
After the first efforts to upgrade the Analytics cluster to trusty during the previous week, the analytics1003 Cisco box no longer ran reliably over the weekend [6]. There were kernel panics, it is not yet fully clear what is going on there. The kernel panics seem to occur even if the machine is not running services.
analytics1033’s management interface is not working properly. It will be upgraded once this is fixed.
* Discussions with researchers on how they could take advantage of the cluster
With the increasing amount of data available, researchers are running into issues of how to query the data without grabbing too much resources. So discussions were started on how researchers can instrument the cluster, and for example how to use kafkatee instead of udp2log.
* Allow multiple varnishkafkas on caches
Up to now, only a single varnishkafka has been running on the caches. But in order to feed performance data into kafka, a second varnishkafka on the caches would help. Together with Ori, work was done to allow running more varnishkafkas on the caches. Look out for statsv :-)
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLog... (The date in the incident report is from the previous week. Nonetheless, it's correct that we only started to work on it this week, as we only noticed while hunting down https://lists.wikimedia.org/pipermail/analytics/2014-November/002798.html . It is known that EventLogging monitoring has some holes. Closing some of them is on the agenda since some time, and we also added it to the actionables on the Incident reports)
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLog...
[3] Sadly no public incident report about the database incident. Only on the non-public ops list:
https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html
[4] Sadly no public incident report about the database incident. Only on the non-public ops list:
https://lists.wikimedia.org/mailman/private/ops/2014-November/044167.html
[5] http://mobile-reportcard.wmflabs.org/graphs/ui-daily
[6] https://phabricator.wikimedia.org/T1200