Adventures in Clusterland 2014-11-17--2014-11-23 - Analytics

2 Dec 2014


      Hi,
in the week from 2014-11-17–2014-11-23 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* EventLogging hit throughput limit to database
* Unintended EventLogging deploy of faulty code
* Outage on master of EventLogging's database shard (db1020)
* Outage on master of EventLogging's database shard (db1046)
* Debugging Mobile UI dashboard
* Upgrades of first machines from the cluster to trusty
* Discussions with researchers on how they could take advantage of the cluster
* Allow multiple varnishkafkas on caches
(details below)
Have fun,
Christian
* EventLogging throughput limit to database
One of the teams instrumenting EventLogging silently and
drastically ;-) increased the volume of events they are producing [1].
The total volume of events that the EventLogging infrastructure had to
handle jumped from ~140 msgs/s to ~220 msgs/s. This was more than
EventLogging's database writer could bring to the database.
Only ~70% of events made it to the database.
We isolated the issue, overcame the throughput limitation of
EventLogging's database writer and the database writer
(not EventLogging as a whole) can now handle way more events.
In addition to that, the database got backfilled from the plain-file
logs.
* Unintended EventLogging deploy of faulty code
It seems that during efforts to bring EventLogging up-to-date on the
beta cluster, faulty code unintendedly got deployed to
production [2]. With this faulty code base, EventLogging's database
writer crashed several times.
Known good code got deployed again, and the database got back-filled
from plain-file logs.
* Outage on master of EventLogging's database shard (db1020)
m2-master's mysqld process aborted [3] and hence EventLogging had no
database to write to it. Ops quickly failed-over to the slave db1046,
and thereby addressed the issue, and we backfilled the EventLogging
database from plain-file logs.
* Outage on master of EventLogging's database shard (db1046)
Shortly after m2-master got failed over to db1046 (see above), db1046
had issues around its threadpool [4]. EventLogging could not connect to
the database, and consequently could not write events to it. Ops
quickly fixed the issue, and we backfilled the EventLogging database
from plain-file logs.
* Debugging Mobile UI dashboard
The mobile UI dashboard was having issues, and since it is based on
EventLogging data, people assumed that the dashboard issues are caused
by EventLogging's issues. We helped to debug the dashboard, and
point people to the real issue. EventLogging was not the culprit.
Regardless, the relevant graph [5] is working again.
* Upgrades of first machines from the cluster to trusty
After the first efforts to upgrade the Analytics cluster to trusty
during the previous week, the analytics1003 Cisco box no longer ran
reliably over the weekend [6]. There were kernel panics, it is not yet
fully clear what is going on there. The kernel panics seem to occur
even if the machine is not running services.
analytics1033’s management interface is not working properly.  It will
be upgraded once this is fixed.
* Discussions with researchers on how they could take advantage of the cluster
With the increasing amount of data available, researchers are running
into issues of how to query the data without grabbing too much
resources. So discussions were started on how researchers can
instrument the cluster, and for example how to use kafkatee instead of
udp2log.
* Allow multiple varnishkafkas on caches
Up to now, only a single varnishkafka has been running on the
caches. But in order to feed performance data into kafka, a second
varnishkafka on the caches would help. Together with Ori, work was
done to allow running more varnishkafkas on the caches.
Look out for statsv :-)
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141114-EventLog...
(The date in the incident report is from the previous
week. Nonetheless, it's correct that we only started to work on it
this week, as we only noticed while hunting down
  https://lists.wikimedia.org/pipermail/analytics/2014-November/002798.html
. It is known that EventLogging monitoring has some holes. Closing
some of them is on the agenda since some time, and we also added it to
the actionables on the Incident reports)
[2] https://wikitech.wikimedia.org/wiki/Incident_documentation/20141118-EventLog...
[3] Sadly no public incident report about the database incident. Only
on the non-public ops list:
https://lists.wikimedia.org/mailman/private/ops/2014-November/043964.html
[4] Sadly no public incident report about the database incident. Only
on the non-public ops list:
https://lists.wikimedia.org/mailman/private/ops/2014-November/044167.html
[5] http://mobile-reportcard.wmflabs.org/graphs/ui-daily
[6] https://phabricator.wikimedia.org/T1200
-- 
---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------