New subject: Incident report: Multiple breakages due to wfLoadExtension() transition

10 Aug 2016


      TLDR: migration of 2 extensions to wfLoadExtension() resulted in problems,
Logstash wasn't displaying them.
== Timeline ==
Previous days
* In a massive effort by many people, lots of extensions were converted to
extension.json, including
** Timeline in https://gerrit.wikimedia.org/r/#/c/303248/
** ContactPage in https://gerrit.wikimedia.org/r/#/c/298084/
* These changes were not compatible with our current production
configuration and thus had to be accompanied with mediawiki-config changes
and probably be deployed separately to minimize the chance of screwup.
* Furthermore, even a cursory testing of the above Timeline change would
have shown that it is broken.
August 9
* After 12:00 SF time Mukunda deploys train to stage 0 wikis
* At 16:00 Max prepares for SWAT but sees errors in fatalmonitor and
investigates:
** Creating default object from empty value in
/srv/mediawiki/wmf-config/CommonSettings.php
on line 686
** Undefined variable: wgContactConfig in
/srv/mediawiki/wmf-config/CommonSettings.php
on line 968
* Max sees no such errors in Logstash.
* After identifying the cause, Max starts reverting the affected
extensions, however there were a lot of intermediate commits and Reedy was
committing fixes so Max proceeds with deploying the fixes instead.
* Fixes produced more problems. Max contemplates a revert of group0 back to
wmf.13 but decides not to because he has never done that before and fixes
kept on coming. In the hindsight, this was a mistake.
* Config fixes to accommodate for wmf.14 started causing notices in wmf.13
so Max resets wmf.13 Timeline to wmf.14.
* Errors indicating more breakages in Timeline prompt another batch of
fixes.
* At 17:42, everything is back to normal.
== Casualties ==
* Max's liver.
* Evening SWAT didn't happen.
* For about 10 minutes, new timeline generation on production wikis was
broken.
== Conclusions ==
* Our code review practices are lax, including merging hairy patches
without testing and self-merges.
* Timeline has 0 (zero) tests while just a single parser test would have
allowed to detect problems during code review.
* Logstash fatalmonitor dashboard isn't displaying HHVM warnings/errors
right now.
* And Logstash is used by scap to verify error levels, rendering this check
useless.
* Logstash/Kibana is probably too complex a beast to be trusted to be the
definitive source of MediaWiki health information, fatalmonitor is still
more reliable. Invest time in improving it and merging with
exceptionmonitor?
* In ongoing outage with logs full of noise, testing stuff on canary
servers is hard as non-fatal errors are easy to miss on fluorine. Deployers
need access to HHVM logs on all appservers.
* Beta cluster isn't serving its purpose of being the first line of defense
against bugs (other than "oh, whole thing is down"). Errors in beta should
be watched as closely as in prod and should be treated with the same level
of seriousness, because otherwise the former will eventually turn into the
latter.
-- 
Best regards,
Max Semenik ([[User:MaxSem]])