TLDR: migration of 2 extensions to wfLoadExtension() resulted in problems, Logstash wasn't displaying them.
== Timeline == Previous days * In a massive effort by many people, lots of extensions were converted to extension.json, including ** Timeline in https://gerrit.wikimedia.org/r/#/c/303248/ ** ContactPage in https://gerrit.wikimedia.org/r/#/c/298084/ * These changes were not compatible with our current production configuration and thus had to be accompanied with mediawiki-config changes and probably be deployed separately to minimize the chance of screwup. * Furthermore, even a cursory testing of the above Timeline change would have shown that it is broken.
August 9 * After 12:00 SF time Mukunda deploys train to stage 0 wikis * At 16:00 Max prepares for SWAT but sees errors in fatalmonitor and investigates: ** Creating default object from empty value in /srv/mediawiki/wmf-config/CommonSettings.php on line 686 ** Undefined variable: wgContactConfig in /srv/mediawiki/wmf-config/CommonSettings.php on line 968 * Max sees no such errors in Logstash. * After identifying the cause, Max starts reverting the affected extensions, however there were a lot of intermediate commits and Reedy was committing fixes so Max proceeds with deploying the fixes instead. * Fixes produced more problems. Max contemplates a revert of group0 back to wmf.13 but decides not to because he has never done that before and fixes kept on coming. In the hindsight, this was a mistake. * Config fixes to accommodate for wmf.14 started causing notices in wmf.13 so Max resets wmf.13 Timeline to wmf.14. * Errors indicating more breakages in Timeline prompt another batch of fixes. * At 17:42, everything is back to normal.
== Casualties ==
* Max's liver.
* Evening SWAT didn't happen.
* For about 10 minutes, new timeline generation on production wikis was broken.
== Conclusions ==
* Our code review practices are lax, including merging hairy patches without testing and self-merges.
* Timeline has 0 (zero) tests while just a single parser test would have allowed to detect problems during code review.
* Logstash fatalmonitor dashboard isn't displaying HHVM warnings/errors right now.
* And Logstash is used by scap to verify error levels, rendering this check useless.
* Logstash/Kibana is probably too complex a beast to be trusted to be the definitive source of MediaWiki health information, fatalmonitor is still more reliable. Invest time in improving it and merging with exceptionmonitor?
* In ongoing outage with logs full of noise, testing stuff on canary servers is hard as non-fatal errors are easy to miss on fluorine. Deployers need access to HHVM logs on all appservers.
* Beta cluster isn't serving its purpose of being the first line of defense against bugs (other than "oh, whole thing is down"). Errors in beta should be watched as closely as in prod and should be treated with the same level of seriousness, because otherwise the former will eventually turn into the latter.