Hi Anne,
You are right, it a critical learning experience. And post-mortem are our
standard operational procedures (you may have seen some around security
issues).
I will let the team comment on when/where one is available. I'd like to
read it as well.
Lila
On Tue, Oct 27, 2015 at 1:02 PM, Risker <risker.wp(a)gmail.com> wrote:
The incident report does not go far enough back into
the history of the
incident. It does not explain how this code managed to get into the
deployment chain with a fatal error in it. It does not identify ways to
prevent that from happening in the future.
Even the most conscientious and perfectionist developer will make the
occasional error - and the root problem here is not the error itself, but
the fact that anything that can take the entire Wikimedia cluster down for
9 minutes got deployed onto production wikis. Nine minutes of downtime on
one of the world's top-10 websites, caused by an *internal* error rather
than an external attack, is a very, very big deal, but I'm not getting that
impression from anything written here, on phabricator, or in the report
itself. That disappoints me far more than that an error was made in the
first place.
Risker/Anne
On 26 October 2015 at 23:04, MZMcBride <z(a)mzmcbride.com> wrote:
Greg Grossmeier wrote:
All is better now. Outage lasted about 10
minutes.
Full incident report will be written by Erik B today.
https://wikitech.wikimedia.org/wiki/Special:Permalink/197206
MZMcBride
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l