All wikis reverted to wmf.8 last night due to T119736

List overview All Threads
Download

newer

older

RevisionSlider: First round of...

Anyone use the ImageMetrics...

Greg Grossmeier

12 Jul 2016 12 Jul '16

8:24 p.m.

https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}"

There was an order of magnitude increase in the rate of those errors that started on July 7th.

Investigation and remediation is on-going.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Attachments:

signature.asc (application/pgp-signature — 819 bytes)

Show replies by date

Greg Grossmeier

13 Jul 13 Jul

3:07 a.m.

New subject: The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

...

https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}"

There was an order of magnitude increase in the rate of those errors that started on July 7th.

Investigation and remediation is on-going.

Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]

1.28.0-wmf.10 will be branched tomorrow and we will run an abbreviated train schedule (group0 and group1 on Wednesday, group2 on Thursday).

Thanks to Matt Flaschen and Brad Jorsch (and others like Ori Livneh and Bryan Davis) for their help.

Sorry for the inconvenience.

Greg

[0] Modulo https://phabricator.wikimedia.org/T140156 which shouldn't effect auto-creation.

[1] Users who were affected by this already before the fixes in the code were deployed will still have the issue until the script that fixes those cases completes running, which takes roughly 1 day. There is a run of it going now, but we will run it again as we deployed fixes mid-run.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

Ori Livneh

3:56 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:

...

<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.

Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]

Is it actually fixed? It doesn't look like it, from the logs.

Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.

Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.

I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.

I think we should also reach out to the users that were affected and apologize.

Greg Grossmeier

4:13 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

...

On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:

...
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.

Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]

Is it actually fixed? It doesn't look like it, from the logs.

That was the information I was given. If it is not improved after the fixes and letting the maint script finish then we'll know more certainly, and with that certainty can modify our plans (as we always do).

...

Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.

Matt is working on an incident report for this.

...

I think we should also reach out to the users that were affected and apologize.

That certainly should/could be one of the action items.

Greg

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

aude

4:15 a.m.

New subject: [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh ori@wikimedia.org wrote:

...

On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier greg@wikimedia.org wrote:

...
<quote name="Greg Grossmeier" date="2016-07-12" time="09:24:38 -0700"> > https://phabricator.wikimedia.org/T119736 - "Could not find local user data for {Username}@{wiki}" > > There was an order of magnitude increase in the rate of those errors > that started on July 7th. > > Investigation and remediation is on-going.

Investigation and remediation is mostly complete[0] and the vast majority of cases have been addressed. There are still users who will experience this error for the next ~1 day.[1]

Is it actually fixed? It doesn't look like it, from the logs.

Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.

Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.

This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?

A new user ran into this issue in June at an editathon that I attended. In his case, I could fix the problem by manually deleting the offending row in the database, but most of the time, the user likely gives up :(

...

I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.

I think we should also reach out to the users that were affected and apologize.

Cheers, Katie

...

Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops

-- @wikidata

Matthew Flaschen

5:13 a.m.

New subject: [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On 07/12/2016 08:15 PM, aude wrote:

...

This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?

This has not all been caused by Echo, and it really isn't one bug, just one symptom.

There are clearly multiple causes. The Echo one has been addressed, and there are multiple fixes and mitigation on the CentralAuth/core auth side, some merged (e.g. https://gerrit.wikimedia.org/r/#/c/298531/ , https://gerrit.wikimedia.org/r/#/c/298416/ ), some still being worked on/discussed ( https://gerrit.wikimedia.org/r/#/c/297946/ , https://gerrit.wikimedia.org/r/#/c/297936/ ), but work is not done.

Matt

Giuseppe Lavagetto

9:48 a.m.

New subject: [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On Wed, Jul 13, 2016 at 2:15 AM, aude aude.wiki@gmail.com wrote:

...

On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh ori@wikimedia.org wrote:

...
Our failure to react to this swiftly and comprehensively is appalling and embarrassing. It represents failure of process at multiple levels and a lack of accountability.

This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?

I am sure we could've done way better even in our current structure, but it's pretty clear to me that the absence of a team dedicated to MediaWiki itself calls for such things to happen.

Which is pretty absurd, when you remember that 99% of our traffic is still served by it.

Cheers

-- Giuseppe Lavagetto, Ph.d. Senior Technical Operations Engineer, Wikimedia Foundation

Andre Klapper

5:44 p.m.

New subject: [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On Tue, 2016-07-12 at 20:15 -0400, aude wrote:

...

This (unbreak now) bug has been open since November. I wonder how this has been allowed to remain open and not addressed for this long?

FYI, Matt created a task about "Unbreak now" priority, to receive input from Team-Practices: https://phabricator.wikimedia.org/T140207

andre

-- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

Matthew Flaschen

5:25 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On 07/12/2016 07:56 PM, Ori Livneh wrote:

...

Is it actually fixed? It doesn't look like it, from the logs.

It's beyond unhelpful that you would send this email without pointing to the logs you are referring to. With a statement like that, a paste is called for.

If you mean the existing inconsistent state that already exists, there is a script running as Greg explicitly noted.

...

It represents failure of process at multiple levels and a lack of accountability.

"Lack of accountability" is a serious charge, and one that I disagree with. That would imply people did not take responsibility for their code's failures, or did not this seriously, and that is not what I see. The Collaboration team and other people, such as Bryan Davis, worked on this promptly as soon as they were made aware, and I take full responsibility for causing this issue.

The severity level may not have been evident until last night (thanks to Legoktm for helping show this). Could the severity have been realized sooner? Yes, but I'm not sure this is the way to make that happen.

...

I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our processes and organizational structure to prevent a recurrence.

I am already writing an incident report, and I welcome a discussion.

However, I strongly disagree with the attitude that /there was a serious bug; therefore no one cared/ .

I don't dispute it's a very serious and unfortunate bug, and I agree we should work to prevent bugs, and ensure they're remediated more promptly.

But I take my work and the extensions my team is responsible for seriously, and I worked on this urgently as soon as I knew about it.

Matt Flaschen

Matthew Flaschen

7:45 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On 07/12/2016 09:25 PM, Matthew Flaschen wrote:

...

I am already writing an incident report, and I welcome a discussion.

Incident report for the Echo part of this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCent... . Please edit and improve.

Thanks,

Matt

Matthew Flaschen

21 Jul 21 Jul

7:51 p.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

I want to apologize for my response to Ori on this thread. I shouldn't have responded like that, and I'm sorry.

Matt

On 07/12/2016 09:25 PM, Matthew Flaschen wrote:

...

On 07/12/2016 07:56 PM, Ori Livneh wrote:[

[...]

Legoktm

13 Jul 13 Jul

7:35 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

Hi,

On 07/12/2016 04:56 PM, Ori Livneh wrote:

...

Is it actually fixed? It doesn't look like it, from the logs.

Since midnight UTC on July 7, 3,195 distinct users have tried and failed to log in a combined total of 25,047 times, or an average of approximately eight times per user. The six days that have passed since then were business as usual for the Wikimedia Engineering.

We should not be blocking login anymore. The patch[1] I deployed last night catches the exceptions so users are able to login, but still continues to log them. I'm not sure if there's a way to tell the difference between an exception that was shown to a user and one that was just logged.

[1] https://gerrit.wikimedia.org/r/#/c/298416/

-- Legoktm

Matthew Flaschen

14 Jul 14 Jul

2:14 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On 07/12/2016 11:35 PM, Legoktm wrote:

...

We should not be blocking login anymore. The patch[1] I deployed last night catches the exceptions so users are able to login, but still continues to log them.

Does that still apply if they're logging in *to* the wiki where their user row is missing?

I know it fixes the issue "I can't log into English Wikipedia because my account on randomwiki is messed up".

Matt

Adam Baso

13 Jul 13 Jul

7:19 p.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

...

I think we need to have a serious discussion about what happened, and think very hard about the changes we would need to make to our

Hi all, I'm going to schedule some time next week to discuss the incident and its response. Good writeup https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth, by the way, Matt.

I think we should also reach out to the users that were affected and

...

apologize.

I agree. Can someone please privately provide me a list of affected users so we can work with a community liaison and engineer to communicate out a "sorry" message?

-Adam

Adam Baso

21 Jul 21 Jul

1:20 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

...

Hi all, I'm going to schedule some time next week to discuss the incident and its response. Good writeup https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth, by the way, Matt.

Notes posted:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCent...

Matthew Flaschen

13 Jul 13 Jul

5:26 a.m.

New subject: [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

On 07/12/2016 07:07 PM, Greg Grossmeier wrote:

...

Thanks to Matt Flaschen and Brad Jorsch (and others like Ori Livneh and Bryan Davis) for their help.

Also Roan Kattouw, Kunal Mehta, and Stephane Bisson.

Matt

3054

Age (days ago)

3063

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

8 participants

tags (0)

participants (8)

Adam Baso
Andre Klapper
aude
Giuseppe Lavagetto
Greg Grossmeier
Legoktm
Matthew Flaschen
Ori Livneh