TLDR: migration of 2 extensions to wfLoadExtension() resulted in problems, Logstash wasn't displaying them.
== Timeline == Previous days * In a massive effort by many people, lots of extensions were converted to extension.json, including ** Timeline in https://gerrit.wikimedia.org/r/#/c/303248/ ** ContactPage in https://gerrit.wikimedia.org/r/#/c/298084/ * These changes were not compatible with our current production configuration and thus had to be accompanied with mediawiki-config changes and probably be deployed separately to minimize the chance of screwup. * Furthermore, even a cursory testing of the above Timeline change would have shown that it is broken.
August 9 * After 12:00 SF time Mukunda deploys train to stage 0 wikis * At 16:00 Max prepares for SWAT but sees errors in fatalmonitor and investigates: ** Creating default object from empty value in /srv/mediawiki/wmf-config/CommonSettings.php on line 686 ** Undefined variable: wgContactConfig in /srv/mediawiki/wmf-config/CommonSettings.php on line 968 * Max sees no such errors in Logstash. * After identifying the cause, Max starts reverting the affected extensions, however there were a lot of intermediate commits and Reedy was committing fixes so Max proceeds with deploying the fixes instead. * Fixes produced more problems. Max contemplates a revert of group0 back to wmf.13 but decides not to because he has never done that before and fixes kept on coming. In the hindsight, this was a mistake. * Config fixes to accommodate for wmf.14 started causing notices in wmf.13 so Max resets wmf.13 Timeline to wmf.14. * Errors indicating more breakages in Timeline prompt another batch of fixes. * At 17:42, everything is back to normal.
== Casualties ==
* Max's liver.
* Evening SWAT didn't happen.
* For about 10 minutes, new timeline generation on production wikis was broken.
== Conclusions ==
* Our code review practices are lax, including merging hairy patches without testing and self-merges.
* Timeline has 0 (zero) tests while just a single parser test would have allowed to detect problems during code review.
* Logstash fatalmonitor dashboard isn't displaying HHVM warnings/errors right now.
* And Logstash is used by scap to verify error levels, rendering this check useless.
* Logstash/Kibana is probably too complex a beast to be trusted to be the definitive source of MediaWiki health information, fatalmonitor is still more reliable. Invest time in improving it and merging with exceptionmonitor?
* In ongoing outage with logs full of noise, testing stuff on canary servers is hard as non-fatal errors are easy to miss on fluorine. Deployers need access to HHVM logs on all appservers.
* Beta cluster isn't serving its purpose of being the first line of defense against bugs (other than "oh, whole thing is down"). Errors in beta should be watched as closely as in prod and should be treated with the same level of seriousness, because otherwise the former will eventually turn into the latter.
On Wed, Aug 10, 2016 at 11:49 AM, Max Semenik maxsem.wiki@gmail.com wrote:
TLDR: migration of 2 extensions to wfLoadExtension() resulted in problems, Logstash wasn't displaying them.
Posted on Wikitech: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160809-MediaWik...
Thanks Max. Questions and comments below:
== Casualties ==
- Max's liver.
:(
== Conclusions ==
- Our code review practices are lax, including merging hairy patches
without testing and self-merges.
Is there a way to improve how code review is done such that appropriate testing would have happened in advance? I'm not referring to this specific case so much as the more general case of making sure that appropriate tests are written and done. Is there a "process owner" for code review who might take a look at this situation and see what can be learned from it for the general code review process?
Pine
On Wed, Aug 10, 2016 at 11:53 AM, Max Semenik maxsem.wiki@gmail.com wrote:
On Wed, Aug 10, 2016 at 11:49 AM, Max Semenik maxsem.wiki@gmail.com wrote:
TLDR: migration of 2 extensions to wfLoadExtension() resulted in
problems,
Logstash wasn't displaying them.
Posted on Wikitech: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160809- MediaWiki
-- Best regards, Max Semenik ([[User:MaxSem]]) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
<quote name="Pine W" date="2016-08-11" time="22:25:15 -0700">
== Conclusions ==
- Our code review practices are lax, including merging hairy patches
without testing and self-merges.
Is there a way to improve how code review is done such that appropriate testing would have happened in advance? I'm not referring to this specific case so much as the more general case of making sure that appropriate tests are written and done. Is there a "process owner" for code review who might take a look at this situation and see what can be learned from it for the general code review process?
We haven't previously officially appointed a process owner for code review, but the Release Engineering Team and/or ArchCom are the closest things. The code review infrastructure is owned by RelEng, clearly, at least.
To answer the other question: Yes, we (RelEng, at least) are looking into improvement to the code-review process. There are a few related things happening:
* I working to get more explicit statements on what WMF teams consider the set of components for which they own the code review backlog and/or response to issues in production. See a first pass at: https://phabricator.wikimedia.org/T115854
* There has been continued effort by Developer Relations to improve the code review backlog 'issue'. See past work and potential future work as the subtasks of https://phabricator.wikimedia.org/T78768
There's more, but these are the highlights for now.
To more directly answer the question: Yes, we (RelEng) are looking into the situation to see what we can learn from it. I don't see one specific actionable coming out of this specific incident, but instead a group of related endeavors (some small, some large) where some are already in progress and others are still in the planning/formative stage as the way forward.
Best,
Greg
Ok, thanks Greg.
Pine
On Aug 12, 2016 08:49, "Greg Grossmeier" greg@wikimedia.org wrote:
<quote name="Pine W" date="2016-08-11" time="22:25:15 -0700"> > > == Conclusions == > > > > * Our code review practices are lax, including merging hairy patches > > without testing and self-merges. > > Is there a way to improve how code review is done such that appropriate > testing would have happened in advance? I'm not referring to this specific > case so much as the more general case of making sure that appropriate tests > are written and done. Is there a "process owner" for code review who might > take a look at this situation and see what can be learned from it for the > general code review process?
We haven't previously officially appointed a process owner for code review, but the Release Engineering Team and/or ArchCom are the closest things. The code review infrastructure is owned by RelEng, clearly, at least.
To answer the other question: Yes, we (RelEng, at least) are looking into improvement to the code-review process. There are a few related things happening:
I working to get more explicit statements on what WMF teams consider the set of components for which they own the code review backlog and/or response to issues in production. See a first pass at: https://phabricator.wikimedia.org/T115854
There has been continued effort by Developer Relations to improve the code review backlog 'issue'. See past work and potential future work as the subtasks of https://phabricator.wikimedia.org/T78768
There's more, but these are the highlights for now.
To more directly answer the question: Yes, we (RelEng) are looking into the situation to see what we can learn from it. I don't see one specific actionable coming out of this specific incident, but instead a group of related endeavors (some small, some large) where some are already in progress and others are still in the planning/formative stage as the way forward.
Best,
Greg
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org