Hi,
I'm not sure if the outage from yesterday qualifies for an official incident report. Let me know. Anyway, I will write here a report of what happened for future reference.
Timeline:
* 2018-03-06 12:58Z arturo doing package upgrades in toolforge (jessie machines) with clush [0]. All operations are logged in SAL [1]. * 2018-03-06 13:21Z some upgrades failed because a debconf prompt. The debconf prompt happened because not using DEBIAN_FRONTEND=noninteractive This resulted in stalled dpkg operations. Also, there were clashes with puppet apt operations * 2018-03-06 13:21Z arturo killed stalled dpkg procs in toolforge and reconfigured affected packages. There are 2 important affected packages: libnss-ldap and sudo-ldap * 2018-03-06 13:32Z users reports of some tools in toolforge misbehaving via IRC and phabricator [2][3][4], we start investigating (arturo and chico) * 2018-03-06 13:38Z firsts investigations are directed towards DB issues, so the DBA team is contacted. They confirm all is working fine in their side. * 2018-03-06 14:07Z chase arrives the scene and start investigating. NFS clients issues are detected. Users can't read files in their home directories in toolforge. * 2018-03-06 14:23Z chasemp downtimes icinga alert for k8s workers * 2018-03-06 15:13Z Andrew, Madhu, Bryan and Brook come to give some helping hand. * 2018-03-06 15:21Z tracking task is created in phabricator [5]. By this time, is more than clear that the issue is related to the earlier package upgrades. * 2018-03-06 15:27Z some toolforge servers are rebooted. It is suggested we start rebuilding part of the cluster. * 2018-03-06 15:57Z Madhu reports that nscd restart + nscd cache flush + machine reboot + puppet run can get servers back into good state. * 2018-03-06 16:21Z All systems are back to normal state.
Notes:
* Package upgrade operations were carried after several previous tests in a list of canary servers [6]. * The need for DEBIAN_FRONTEND=noninteractive was already known, but a human error was produced (arturo forgot to use it when doing upgrades) * Package pinning for nss/ldap/pam packages are in place, but is not enough. We need a *complete* freeze of these packages. * This evidences our upgrade workflow [7] is not ready for wide usage and needs more development.
Conclusions:
* extend apt pinning for more nss/ldap/pam packages * study implementing apt holds for those packages by puppet? * embed DEBIAN_FRONTEND=noninteractive into apt-upgrade script * better integration of apt-upgrade with other apt operations (specially puppet). Perhaps we could auto disable puppet from within apt-upgrade script while in operations
Data:
* Affected tools in toolforge: at least 3 [2][3][4] * Amount of downtime: 3h (13:21Z --> 16:21Z)
Related links:
[0] toolforge: package upgrades as part of the new workflow https://phabricator.wikimedia.org/T188994 [1] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [2] Global User Contributions complains about replica conf file https://phabricator.wikimedia.org/T189001 [3] Connection Error at OrphanTalk Tools https://phabricator.wikimedia.org/T188998 [4] Tool https://tools.wmflabs.org/replag/ is reported via IRC [5] Toolforge Iinstances (maybe only Jessie?) are having issues with NFS/LDAP https://phabricator.wikimedia.org/T189018 [6] https://etherpad.wikimedia.org/p/toolforge-upgrades [7] create 'attended' upgrade workflow for cloud with Toolforge as canonical case https://phabricator.wikimedia.org/T181647