On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier <greg(a)wikimedia.org> wrote:
<quote name="Greg Grossmeier"
date="2016-07-12" time="09:24:38 -0700">
data for {Username}@{wiki}"
There was an order of magnitude increase in the rate of those errors
that started on July 7th.
Investigation and remediation is on-going.
Investigation and remediation is mostly complete[0] and the vast
majority of cases have been addressed. There are still users who will
experience this error for the next ~1 day.[1]
Is it actually fixed? It doesn't look like it, from the logs.
Since midnight UTC on July 7, 3,195 distinct users have tried and failed to
log in a combined total of 25,047 times, or an average of approximately
eight times per user. The six days that have passed since then were
business as usual for the Wikimedia Engineering.
Our failure to react to this swiftly and comprehensively is appalling and
embarrassing. It represents failure of process at multiple levels and a
lack of accountability.
I think we need to have a serious discussion about what happened, and think
very hard about the changes we would need to make to our processes and
organizational structure to prevent a recurrence.
I think we should also reach out to the users that were affected and
apologize.