On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.
Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]
Is it actually fixed? It doesn't look like it, from the logs.
Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.
Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.
This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?
A new user ran into this issue in June at an editathon that I attended. In his case, I could fix the problem by manually deleting the offending row in the database, but most of the time, the user likely gives up :(
I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.
I think we should also reach out to the users that were affected and apologize.
+1
Cheers, Katie
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops