Hi,
I'm not sure if the outage from yesterday qualifies for an official
incident report. Let me know. Anyway, I will write here a report of what
happened for future reference.
Timeline:
* 2018-03-06 12:58Z arturo doing package upgrades in toolforge (jessie
machines) with clush [0]. All operations are logged in SAL [1].
* 2018-03-06 13:21Z some upgrades failed because a debconf prompt. The
debconf prompt happened because not using DEBIAN_FRONTEND=noninteractive
This resulted in stalled dpkg operations. Also, there were clashes with
puppet apt operations
* 2018-03-06 13:21Z arturo killed stalled dpkg procs in toolforge and
reconfigured affected packages. There are 2 important affected packages:
libnss-ldap and sudo-ldap
* 2018-03-06 13:32Z users reports of some tools in toolforge misbehaving
via IRC and phabricator [2][3][4], we start investigating (arturo and
chico)
* 2018-03-06 13:38Z firsts investigations are directed towards DB
issues, so the DBA team is contacted. They confirm all is working fine
in their side.
* 2018-03-06 14:07Z chase arrives the scene and start investigating. NFS
clients issues are detected. Users can't read files in their home
directories in toolforge.
* 2018-03-06 14:23Z chasemp downtimes icinga alert for k8s workers
* 2018-03-06 15:13Z Andrew, Madhu, Bryan and Brook come to give some
helping hand.
* 2018-03-06 15:21Z tracking task is created in phabricator [5]. By this
time, is more than clear that the issue is related to the earlier
package upgrades.
* 2018-03-06 15:27Z some toolforge servers are rebooted. It is suggested
we start rebuilding part of the cluster.
* 2018-03-06 15:57Z Madhu reports that nscd restart + nscd cache flush +
machine reboot + puppet run can get servers back into good state.
* 2018-03-06 16:21Z All systems are back to normal state.
Notes:
* Package upgrade operations were carried after several previous tests
in a list of canary servers [6].
* The need for DEBIAN_FRONTEND=noninteractive was already known, but a
human error was produced (arturo forgot to use it when doing upgrades)
* Package pinning for nss/ldap/pam packages are in place, but is not
enough. We need a *complete* freeze of these packages.
* This evidences our upgrade workflow [7] is not ready for wide usage
and needs more development.
Conclusions:
* extend apt pinning for more nss/ldap/pam packages
* study implementing apt holds for those packages by puppet?
* embed DEBIAN_FRONTEND=noninteractive into apt-upgrade script
* better integration of apt-upgrade with other apt operations (specially
puppet). Perhaps we could auto disable puppet from within apt-upgrade
script while in operations
Data:
* Affected tools in toolforge: at least 3 [2][3][4]
* Amount of downtime: 3h (13:21Z --> 16:21Z)
Related links:
[0] toolforge: package upgrades as part of the new workflow
https://phabricator.wikimedia.org/T188994
[1]
https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[2] Global User Contributions complains about replica conf file
https://phabricator.wikimedia.org/T189001
[3] Connection Error at OrphanTalk Tools
https://phabricator.wikimedia.org/T188998
[4] Tool
https://tools.wmflabs.org/replag/ is reported via IRC
[5] Toolforge Iinstances (maybe only Jessie?) are having issues with
NFS/LDAP
https://phabricator.wikimedia.org/T189018
[6]
https://etherpad.wikimedia.org/p/toolforge-upgrades
[7] create 'attended' upgrade workflow for cloud with Toolforge as
canonical case
https://phabricator.wikimedia.org/T181647