https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}"
There was an order of magnitude increase in the rate of those errors that started on July 7th.
Investigation and remediation is on-going.
Greg
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700">
https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}"
There was an order of magnitude increase in the rate of those errors that started on July 7th.
Investigation and remediation is on-going.
Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]
1.28.0-wmf.10 will be branched tomorrow and we will run an abbreviated train schedule (group0 and group1 on Wednesday, group2 on Thursday).
Thanks to Matt Flaschen and Brad Jorsch (and others like Ori Livneh and Bryan Davis) for their help.
Sorry for the inconvenience.
Greg
[0] Modulo https://phabricator.wikimedia.org/T140156 which shouldn't effect auto-creation.
[1] Users who were affected by this already before the fixes in the code were deployed will still have the issue until the script that fixes those cases completes running, which takes roughly 1 day. There is a run of it going now, but we will run it again as we deployed fixes mid-run.
On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.
Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]
Is it actually fixed? It doesn't look like it, from the logs.
Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.
Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.
I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.
I think we should also reach out to the users that were affected and apologize.
<quote name="Ori Livneh" date="2016-07-12" time="16:56:11 -0700">
On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.
Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]
Is it actually fixed? It doesn't look like it, from the logs.
That was the information I was given. If it is not improved after the fixes and letting the maint script finish then we'll know more certainly, and with that certainty can modify our plans (as we always do).
Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.
Matt is working on an incident report for this.
I think we should also reach out to the users that were affected and apologize.
That certainly should/could be one of the action items.
Greg
On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh ori@wikimedia.org wrote:
On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.
Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]
Is it actually fixed? It doesn't look like it, from the logs.
Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.
Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.
This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?
A new user ran into this issue in June at an editathon that I attended. In his case, I could fix the problem by manually deleting the offending row in the database, but most of the time, the user likely gives up :(
I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.
I think we should also reach out to the users that were affected and apologize.
+1
Cheers, Katie
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
On 07/12/2016 08:15 PM, aude wrote:
This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?
This has not all been caused by Echo, and it really isn't one bug, just one symptom.
There are clearly multiple causes. The Echo one has been addressed, and there are multiple fixes and mitigation on the CentralAuth/core auth side, some merged (e.g. https://gerrit.wikimedia.org/r/#/c/298531/ , https://gerrit.wikimedia.org/r/#/c/298416/ ), some still being worked on/discussed ( https://gerrit.wikimedia.org/r/#/c/297946/ , https://gerrit.wikimedia.org/r/#/c/297936/ ), but work is not done.
Matt
On Wed, Jul 13, 2016 at 2:15 AM, aude aude.wiki@gmail.com wrote:
On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh ori@wikimedia.org wrote:
Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.
This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?
I am sure we could've done way better even in our current structure, but it's pretty clear to me that the absence of a team dedicated to MediaWiki itself calls for such things to happen.
Which is pretty absurd, when you remember that 99% of our traffic is still served by it.
Cheers
G.
On Tue, 2016-07-12 at 20:15 -0400, aude wrote:
This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?
FYI, Matt created a task about "Unbreak now" priority, to receive input from Team-Practices: https://phabricator.wikimedia.org/T140207
andre
On 07/12/2016 07:56 PM, Ori Livneh wrote:
Is it actually fixed? It doesn't look like it, from the logs.
It's beyond unhelpful that you would send this email without pointing to the logs you are referring to. With a statement like that, a paste is called for.
If you mean the existing inconsistent state that already exists, there is a script running as Greg explicitly noted.
It represents failure of process at multiple levels and a lack of accountability.
"Lack of accountability" is a serious charge, and one that I disagree with. That would imply people did not take responsibility for their code's failures, or did not this seriously, and that is not what I see. The Collaboration team and other people, such as Bryan Davis, worked on this promptly as soon as they were made aware, and I take full responsibility for causing this issue.
The severity level may not have been evident until last night (thanks to Legoktm for helping show this). Could the severity have been realized sooner? Yes, but I'm not sure this is the way to make that happen.
I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.
I am already writing an incident report, and I welcome a discussion.
However, I strongly disagree with the attitude that /there was a serious bug; therefore no one cared/ .
I don't dispute it's a very serious and unfortunate bug, and I agree we should work to prevent bugs, and ensure they're remediated more promptly.
But I take my work and the extensions my team is responsible for seriously, and I worked on this urgently as soon as I knew about it.
Matt Flaschen
On 07/12/2016 09:25 PM, Matthew Flaschen wrote:
I am already writing an incident report, and I welcome a discussion.
Incident report for the Echo part of this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCent... . Please edit and improve.
Thanks,
Matt
I want to apologize for my response to Ori on this thread. I shouldn't have responded like that, and I'm sorry.
Matt
On 07/12/2016 09:25 PM, Matthew Flaschen wrote:
On 07/12/2016 07:56 PM, Ori Livneh wrote:[
[...]
Hi,
On 07/12/2016 04:56 PM, Ori Livneh wrote:
Is it actually fixed? It doesn't look like it, from the logs.
Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.
We should not be blocking login anymore. The patch[1] I deployed last night catches the exceptions so users are able to login, but still continues to log them. I'm not sure if there's a way to tell the difference between an exception that was shown to a user and one that was just logged.
[1] https://gerrit.wikimedia.org/r/#/c/298416/
-- Legoktm
On 07/12/2016 11:35 PM, Legoktm wrote:
We should not be blocking login anymore. The patch[1] I deployed last night catches the exceptions so users are able to login, but still continues to log them.
Does that still apply if they're logging in *to* the wiki where their user row is missing?
I know it fixes the issue "I can't log into English Wikipedia because my account on randomwiki is messed up".
Matt
I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our
Hi all, I'm going to schedule some time next week to discuss the incident and its response. Good writeup https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth, by the way, Matt.
I think we should also reach out to the users that were affected and
apologize.
I agree. Can someone please privately provide me a list of affected users so we can work with a community liaison and engineer to communicate out a "sorry" message?
-Adam
Hi all, I'm going to schedule some time next week to discuss the incident and its response. Good writeup https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth, by the way, Matt.
Notes posted:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCent...
On 07/12/2016 07:07 PM, Greg Grossmeier wrote:
Thanks to Matt Flaschen and Brad Jorsch (and others like Ori Livneh and Bryan Davis) for their help.
Also Roan Kattouw, Kunal Mehta, and Stephane Bisson.
Matt
wikitech-l@lists.wikimedia.org