[Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

Matthew Flaschen mflaschen at wikimedia.org
Wed Jul 13 01:25:01 UTC 2016


On 07/12/2016 07:56 PM, Ori Livneh wrote:
> Is it actually fixed? It doesn't look like it, from the logs.

It's beyond unhelpful that you would send this email without pointing to 
the logs you are referring to.  With a statement like that, a paste is 
called for. 	

If you mean the existing inconsistent state that already exists, there 
is a script running as Greg explicitly noted.

> It represents failure of process at multiple levels
> and a lack of accountability.

"Lack of accountability" is a serious charge, and one that I disagree 
with.  That would imply people did not take responsibility for their 
code's failures, or did not this seriously, and that is not what I see. 
  The Collaboration team and other people, such as Bryan Davis, worked 
on this promptly as soon as they were made aware, and I take full 
responsibility for causing this issue.

The severity level may not have been evident until last night (thanks to 
Legoktm for helping show this).  Could the severity have been realized 
sooner?  Yes, but I'm not sure this is the way to make that happen.

> I think we need to have a serious discussion about what happened, and
> think very hard about the changes we would need to make to our processes
> and organizational structure to prevent a recurrence.

I am already writing an incident report, and I welcome a discussion.

However, I strongly disagree with the attitude that /there was a serious 
bug; therefore no one cared/ .

I don't dispute it's a very serious and unfortunate bug, and I agree we 
should work to prevent bugs, and ensure they're remediated more promptly.

But I take my work and the extensions my team is responsible for 
seriously, and I worked on this urgently as soon as I knew about it.

Matt Flaschen



More information about the Engineering mailing list