Wikipedia is down

List overview All Threads
Download

newer

older

[MediaWiki-announce] End of HTTP...

Tech Talk: Introduction to Free...

Strainu

26 Oct 2015 26 Oct '15

11:18 a.m.

Show replies by date

Amir E. Aharoni

26 Oct 26 Oct

11:21 a.m.

I had this a few minutes ago and now I get Error Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. בתאריך 26 באוק׳ 2015 17:20,‏ "Strainu" <strainu10(a)gmail.com> כתב:

...

All Wikimedia sites show me the same error. I'd log a bug if I could login: MediaWiki internal error. Exception caught inside exception handler. Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. Regards, Strainu _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Steinsplitter Wiki

11:21 a.m.

Please see https://phabricator.wikimedia.org/T116593 :-)

...

Date: Mon, 26 Oct 2015 17:21:09 +0200 From: amir.aharoni(a)mail.huji.ac.il To: wikitech-l(a)lists.wikimedia.org Subject: Re: [Wikitech-l] Wikipedia is down I had this a few minutes ago and now I get Error Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. בתאריך 26 באוק׳ 2015 17:20,‏ "Strainu" <strainu10(a)gmail.com> כתב:

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Toby Negrin

11:57 a.m.

Working for me now (via mobile web) On Monday, October 26, 2015, Steinsplitter Wiki <steinsplitter-wiki(a)live.com> wrote:

...

Please see https://phabricator.wikimedia.org/T116593 :-)

Date: Mon, 26 Oct 2015 17:21:09 +0200 From: amir.aharoni(a)mail.huji.ac.il <javascript:;> To: wikitech-l(a)lists.wikimedia.org <javascript:;> Subject: Re: [Wikitech-l] Wikipedia is down I had this a few minutes ago and now I get Error Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. בתאריך 26 באוק׳ 2015 17:20,‏ "Strainu" <strainu10(a)gmail.com

<javascript:;>> כתב:

> All Wikimedia sites show me the same error. I'd log a bug if I could

MediaWiki internal error. Exception caught inside exception handler. Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. Regards, Strainu _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org <javascript:;> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org <javascript:;> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Greg Grossmeier

12:08 p.m.

All is better now. Outage lasted about 10 minutes. Full incident report will be written by Erik B today. Greg <quote name="Toby Negrin" date="2015-10-26" time="08:57:07 -0700">

...

Working for me now (via mobile web) On Monday, October 26, 2015, Steinsplitter Wiki <steinsplitter-wiki(a)live.com> wrote:

Please see https://phabricator.wikimedia.org/T116593 :-)

<javascript:;>> כתב:

> All Wikimedia sites show me the same error. I'd log a bug if I could

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org <javascript:;> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

MZMcBride

11:04 p.m.

Greg Grossmeier wrote:

...

All is better now. Outage lasted about 10 minutes. Full incident report will be written by Erik B today.

https://wikitech.wikimedia.org/wiki/Special:Permalink/197206 MZMcBride

Risker

27 Oct 27 Oct

8:02 a.m.

The incident report does not go far enough back into the history of the incident. It does not explain how this code managed to get into the deployment chain with a fatal error in it. It does not identify ways to prevent that from happening in the future. Even the most conscientious and perfectionist developer will make the occasional error - and the root problem here is not the error itself, but the fact that anything that can take the entire Wikimedia cluster down for 9 minutes got deployed onto production wikis. Nine minutes of downtime on one of the world's top-10 websites, caused by an *internal* error rather than an external attack, is a very, very big deal, but I'm not getting that impression from anything written here, on phabricator, or in the report itself. That disappoints me far more than that an error was made in the first place. Risker/Anne On 26 October 2015 at 23:04, MZMcBride <z(a)mzmcbride.com> wrote:

...

Greg Grossmeier wrote:

All is better now. Outage lasted about 10 minutes. Full incident report will be written by Erik B today.

https://wikitech.wikimedia.org/wiki/Special:Permalink/197206 MZMcBride _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Lila Tretikov

8:31 a.m.

Hi Anne, You are right, it a critical learning experience. And post-mortem are our standard operational procedures (you may have seen some around security issues). I will let the team comment on when/where one is available. I'd like to read it as well. Lila On Tue, Oct 27, 2015 at 1:02 PM, Risker <risker.wp(a)gmail.com> wrote:

...

Greg Grossmeier wrote:

All is better now. Outage lasted about 10 minutes. Full incident report will be written by Erik B today.

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brad Jorsch (Anomie)

9:57 a.m.

On Tue, Oct 27, 2015 at 8:02 AM, Risker <risker.wp(a)gmail.com> wrote:

...

The incident report does not go far enough back into the history of the incident. It does not explain how this code managed to get into the deployment chain with a fatal error in it.

Actually, it does. Erik writes "This occured because the patch for the CirrusSearch repository that removed the schema should have been deployed before the change that adds it to the WikimediaEvents repository." In other words, there was nothing wrong with the code itself. The problem was that the multiple pieces of the change needed to be done in a particular order during the manual backporting process, but they were not done in that order. If this had waited for the train deployment, both pieces would have been done simultaneously and it wouldn't have been an issue, just as it wasn't an issue when these changes were done in master and automatically deployed to Beta Labs. -- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Risker

10:29 a.m.

On 27 October 2015 at 09:57, Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org> wrote:

...

On Tue, Oct 27, 2015 at 8:02 AM, Risker <risker.wp(a)gmail.com> wrote:

The incident report does not go far enough back into the history of the incident. It does not explain how this code managed to get into the deployment chain with a fatal error in it.

That's a start, Brad. But even as someone who has limited experience with software deployment, I can think of at least half a dozen questions that I'd be asking here: - Why wasn't it part of the deployment train - As a higher level question, what are the thresholds for using a SWAT deployment as opposed to the regular deployment train, are these standards being followed, and are they the right standards. (Even I notice that most of the big problems seem to come with deployments outside of the deployment train.) - How was the code reviewed and tested before deployment - Why did it appear to work in some contexts (indicated in your response as master and Beta Labs) but not in the production context - How are we ensuring that deployments that require multiple sequential steps are (a) identified and (b) implemented in a way that those steps are followed in the correct order Notice how none of the questions are "what was wrong with the code" or "who screwed up". They're all systems questions. This is a systems problem. Even in situations where there *is* a problem with the code or someone *did* screw up, the root cause usually comes back to having single points of failure (e.g. one person having the ability to [unintentionally] get problem code deployed, or weaknesses in the code review and testing process). Risker/Anne

Brad Jorsch (Anomie)

10:56 a.m.

On Tue, Oct 27, 2015 at 10:29 AM, Risker <risker.wp(a)gmail.com> wrote:

...

- Why wasn't it part of the deployment train

Good question, and one that needs someone involved in this backport to answer.

...

- As a higher level question, what are the thresholds for using a SWAT deployment as opposed to the regular deployment train, are these standards being followed, and are they the right standards. (Even I notice that most of the big problems seem to come with deployments outside of the deployment train.)

My understanding is that SWAT is supposed to be for WMF configuration changes (i.e. the operations/mediawiki-config repo, which this wasn't) and for urgent bug fixes that can't wait for the weekly train. But my understanding might be too strict, so I'd recommend waiting for a more official answer than mine.

...

- How was the code reviewed and tested before deployment

First, it was reviewed before being merged into master. Then the SWAT deployer is supposed to review the backport for potential issues, although they may lack the domain-specific knowledge that the original reviewers have to spot issues like the one here.

...

- Why did it appear to work in some contexts (indicated in your response as master and Beta Labs) but not in the production context

You're assuming this code wouldn't have worked in the production context if deployed correctly. It's like asking "Why does it work to change a lightbulb normally, but it doesn't work if the bulb-changer forgets to remove the burned-out bulb before trying to put the new one in?"

...

- How are we ensuring that deployments that require multiple sequential steps are (a) identified and (b) implemented in a way that those steps are followed in the correct order

It requires that the people proposing/implementing the change identify the prerequisites required. There's currently no automated way to do this, and even if some automated mechanism such as "Depends-On" tags on the git commits were implemented it would require that people correctly use the mechanism and that the mechanism can be automatically tracked during backports as well as normal development merges. There's also the possibility that unit testing could catch such issues when the changes are merged to the deployment branches before being deployed, and our Release Engineering team has been working on increasing the number of extension unit tests run. But that requires we have unit tests that cover everything, which we don't so things can still slip through. It also wouldn't handle the case where the individual files of the change are individually deployed out of order, although at a glance it doesn't seem like that was the issue here. Taking this further to discuss plans, implementation, and mitigation of the remaining process issues is a discussion for the Release Engineering team, and may already be happening somewhere. Once people in SF get into work they might have further comments along these lines. -- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Oliver Keyes

11:03 a.m.

On 27 October 2015 at 10:56, Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org> wrote:

...

On Tue, Oct 27, 2015 at 10:29 AM, Risker <risker.wp(a)gmail.com> wrote:

- Why wasn't it part of the deployment train

Good question, and one that needs someone involved in this backport to answer.

I can sort of answer this question (although don't consider it the canonical answer necessarily; this is just what I know from within the team and the analytics side of things): A few weeks ago changes to the infrastructure around EventLogging meant that our event collection around Search, for desktop events, unexpectedly broke. While it was non-functional, we were unable to collect any data for some of our high-level metrics, including load times and event counts around user searches. Not just "unable to visualise"; the events weren't being collected, meaning that backfilling was impossible - that data was simply gone and the longer we went without a fix, the longer a gap we'd have. Deploying the fix took a while for a couple of reasons - namely, the deployment freeze while Operations were out of town and the EventLogging rollback and freeze due to some issues with other changes to that extension. This extended the period we were missing data for, upping the criticality of fixing it. As of today we're at 1 month and 3 days of missing data. So, that's probably why it was SWATted; the irrevocable impact of waiting, and how long we had /been/ waiting.

...

- How was the code reviewed and tested before deployment

- Why did it appear to work in some contexts (indicated in your response as master and Beta Labs) but not in the production context

- How are we ensuring that deployments that require multiple sequential steps are (a) identified and (b) implemented in a way that those steps are followed in the correct order

-- Oliver Keyes Count Logula Wikimedia Foundation

Greg Grossmeier

11:32 a.m.

Hi all, Thanks, for the discussion. As you can imagine, it is high on my list to figure out why we've had 2 outages in the past couple weeks caused by config changes like this (more accurately, what we can do to prevent it). I think, after reading Brad's, Oliver's, and Erik's (partial, early release due to train) responses most of Risker's questions are answered. I'll just give a bit more from my perspective. <quote name="Brad Jorsch (Anomie)" date="2015-10-27" time="10:56:28 -0400">

...

On Tue, Oct 27, 2015 at 10:29 AM, Risker <risker.wp(a)gmail.com> wrote:

- Why wasn't it part of the deployment train

Good question, and one that needs someone involved in this backport to answer.

Erik and Oliver answered this.

...

Also answered by Erik, and https://wikitech.wikimedia.org/wiki/SWAT_deploys#Guidelines

...

- How was the code reviewed and tested before deployment

All code that is pushed out via a SWAT window has these review points: * Pre-commit review and testing done in Gerrit/Jenkins * Post-commit testing on Beta Cluster (which is automatically updated from master every 10 minutes) * The backport (what is deployed in the SWAT) is again reviewed/tested in Gerrit/Jenkins (will catch stupid errors at this point) * The SWAT deployer does their own review of the patch before committing and deploying.

...

- Why did it appear to work in some contexts (indicated in your response as master and Beta Labs) but not in the production context

This has been answered a few times already and is answered in your next question; the ordering was the issue.

...

- How are we ensuring that deployments that require multiple sequential steps are (a) identified and (b) implemented in a way that those steps are followed in the correct order

Really, this is why we have people do this work instead of machines: People know the (ever evolving) complex system that makes of production Wikimedia "servers" (already a mix of bare metal and virtualization, tons of interconnected services, etc) and thus people are the ones who can make the choices. We don't have the funds and person-hours to have a complete, 1:1, mirror of production as a test environment. It literally costs twice as much for the hardware and then another non-zero amount of people FTEs.

...

There's also the possibility that unit testing could catch such issues when the changes are merged to the deployment branches before being deployed, and our Release Engineering team has been working on increasing the number of extension unit tests run. But that requires we have unit tests that cover everything, which we don't so things can still slip through. It also wouldn't handle the case where the individual files of the change are individually deployed out of order, although at a glance it doesn't seem like that was the issue here.

Unit tests, which I think are related but not the whole solution here, are important to RelEng and I hope we (RelEng) can work with the rest of the engineering staff and community to improve our coverage ASAP. This isn't a task that "QA" or "RelEng" can do; it needs to be owned by all engineers. See, for example: https://integration.wikimedia.org/cover/mediawiki-core/master/php/ That's not a great place to be. And that is just MW Core.

...

Taking this further to discuss plans, implementation, and mitigation of the remaining process issues is a discussion for the Release Engineering team, and may already be happening somewhere. Once people in SF get into work they might have further comments along these lines.

This is mostly touching on scap3, which Erik described a little in his email. Scap3 (it'll just be 'scap', don't worry, scap3 is just the working name of the new feature set we're adding to it) will allow us to catch this kind of error much faster and before many people see it through the use of canary deploys with automated health checks (those health checks are configurable). We (RelEng) will talk with Erik more directly about why it was a 9 min outage vs 2 min soon and see what else we are missing from scap3 or the higher level process. Thanks all, Greg -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Risker

12:37 p.m.

Good start in working this one out. On 27 October 2015 at 11:32, Greg Grossmeier <greg(a)wikimedia.org> wrote:

...

On Tue, Oct 27, 2015 at 10:29 AM, Risker <risker.wp(a)gmail.com> wrote:

- Why wasn't it part of the deployment train

Good question, and one that needs someone involved in this backport to answer.

Erik and Oliver answered this.

Actually their answer is "there was a major change to the infrastructure that broke our ability to collect data, thus escalating the need for this fix to the point that it needed to be done as a SWAT". So, as we keep drilling down, we see that the cause isn't even this "urgent fix", it's the previous deploy from some weeks ago that made the "urgent fix" necessary. And then we need to ask the same questions about that deploy - were the full range of tests done, are they the right tests, how long did it take to notice the problem, was any consideration given to rolling back that problematic deploy, and so on. And again these are systems issues. I agree that MediaWiki is a complex system. But one would think, given the significant focus on data collection, any acceptance testing before a significant infrastructure change would include a review to ensure that it will not impact data collection. (This is the kind of thing I mean when I say "are we doing the right tests".) We know that we've repeatedly seen implementation of publicly visible/publicly editable content creation extensions that don't work with Checkuser, aren't logged like content, aren't deletable or suppressible, can't be edited by others and so on - all of which should be standard requirements that should be tested for whenever such an extension is deployed to a production wiki in the Wikimedia cluster (and I'm pretty sure other MediaWiki users would consider those to be standard expectations too). But these aren't tested for because they aren't part of the standard testing suite. So yes...you're heading in the right direction, but don't think that by explaining why the urgent patch didn't work as expected, you've explained how we got to where we are. Risker/Anne

Brian Wolff

3:16 p.m.

...

And again these are systems issues. I agree that MediaWiki is a complex system. But one would think, given the significant focus on data collection, any acceptance testing before a significant infrastructure change would include a review to ensure that it will not impact data collection. (This is the kind of thing I mean when I say "are we doing the right tests".) We know that we've repeatedly seen implementation of publicly visible/publicly editable content creation extensions that don't work with Checkuser, aren't logged like content, aren't deletable or suppressible, can't be edited by others and so on - all of which should be standard requirements that should be tested for whenever such an extension is deployed to a production wiki in the Wikimedia cluster (and I'm pretty sure other MediaWiki users would consider those to be standard expectations too). But these aren't tested for because they aren't part of the standard testing suite.

[A little offtopic] I don't think lack of testing has anything to do with "We know that we've repeatedly seen implementation of publicly visible/publicly editable content creation extensions that don't work with Checkuser, aren't logged like content, aren't deletable or suppressible, can't be edited by others and so on". Testing doesn't help if you intentionally don't do something (To make a generalization. That might not even be true. But my impression is most of the time those things are "known issues" that people either decide aren't worth implementing, or are leaving to later). -- -bawolff

Erik Bernhardson

1:05 p.m.

Below i have finished out my thoughts that i was writing before the early-send. On Oct 27, 2015 7:29 AM, "Risker" <risker.wp(a)gmail.com> wrote:

...

On 27 October 2015 at 09:57, Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org> wrote: > On Tue, Oct 27, 2015 at 8:02 AM, Risker <risker.wp(a)gmail.com> wrote: > > > The incident report does not go far enough back into the history of

the

...

> > incident. It does not explain how this code managed to get into the > > deployment chain with a fatal error in it. > > > Actually, it does. Erik writes "This occured because the patch for the > CirrusSearch repository that removed the schema should have been

deployed

...

> before the change that adds it to the WikimediaEvents repository." > > In other words, there was nothing wrong with the code itself. The

problem

...

> was that the multiple pieces of the change needed to be done in a > particular order during the manual backporting process, but they were

not

...

> done in that order. > > If this had waited for the train deployment, both pieces would have been > done simultaneously and it wouldn't have been an issue, just as it

wasn't

...

> an issue when these changes were done in master and automatically

deployed

...

to Beta Labs.

This was a fix for something that broke during the previous deployment train. Specifically a hook was changed in core and not noticed in the extension until the events from javascript stopped coming into our logging tables.

...

- As a higher level question, what are the thresholds for using a SWAT

I'm on my phone so y This is

...

deployment as opposed to the regular deployment train, are these

standards

...

being followed, and are they the right standards. (Even I notice that

most

...

of the big problems seem to come with deployments outside of the

deployment

...

train.) - How was the code reviewed and tested before deployment

Code was reviewed and tested as normal, and that process worked as I would expect. What was missing was perhaps clear documentation on the order of patches. As an aside the way this is solved in other organizations, like google, is to have a single repository contain all the code. This has various other problems associated to it but it provides much stronger guarantees against patches being applied in the wrong order. Because of the nature of our project this type of solution is a non-starter, but perhaps somewhere between where we are and an omni repo would make sense.

...

- Why did it appear to work in some contexts (indicated in your

response

...

as master and Beta Labs) but not in the production context

Because, as stated in the report and by brad, the code itself works. The code was redeployed after the outage with no errors because the second time it was deployed in the correct order. This is why code review didn't catch the fatal and the error didn't show up in beta labs. This was an issue primarily with deployment process.

...

- How are we ensuring that deployments that require multiple sequential steps are (a) identified and (b) implemented in a way that those steps

are

...

followed in the correct order Notice how none of the questions are "what was wrong with the code" or

"who

...

screwed up". They're all systems questions. This is a systems problem. Even in situations where there *is* a problem with the code or someone *did* screw up, the root cause usually comes back to having single points of failure (e.g. one person having the ability to [unintentionally] get problem code deployed, or weaknesses in the code review and testing process). Risker/Anne _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

At a higher level, this was a 9 minute outage instead of a 2 or 3 minute outage due to two mistakes I made while doing the revert. Both of these are in the incident report. First the monitor I was watching from our log server to tell me it needs a rollback did not report this error adding a minute or two before the rollback started. Had these errors been included in the `fatalmonitor` the revert would have started the same minute that the code went out. We have other monitors that have been added in the past year that I should have been looking at as well. Second I reverted multiple patches from within gerrit (our code review tool) which takes too long when the site is down. I can only point to inexperience here. Others who have previously taken our sites down informed me that the proper way is to revert directly on the deployment server and follow up with changes in gerrit after the fire has been put out. I've been deploying patches at wmf for a couple years and have always in the past reverted through gerrit, but those didn't need the extra speedy recovery as the site was not down. All prior cases of my personal experience the problem deployment was only logging errors or some specific piece of functionality was not working. Going up another level comes to our deployment tooling specifically. RelEng is working on a project called scap3 which brings our deployment process closer to what you should expect from a top 10 website. It includes canary deployments (eg 1% of servers) along with a single command that undoes the entire deployment. Canary deployments allow to see an error before it is deployed everywhere, and a one command rollback operation would have likely brought the site back 3 to 4 minutes faster than how I reverted the patches. I did not link the scap3 portions as an actionable because, in my mind, that's not a single actionable thing. Scap3 is a major overhaul of our deploy process. Additionally this is already a priority in RelEng.

3140

days inactive

3141

days old

wikitech-l@lists.wikimedia.org

Manage subscription

15 comments

12 participants

tags (0)

participants (12)

Amir E. Aharoni
Brad Jorsch (Anomie)
Brian Wolff
Erik Bernhardson
Greg Grossmeier
Lila Tretikov
MZMcBride
Oliver Keyes
Risker
Steinsplitter Wiki
Strainu
Toby Negrin