Gerrit was down today

List overview All Threads
Download

newer

older

2016-10-12 Scrum of Scrums meeting...

2016W41 ArchCom-RFC meeting:...

Greg Grossmeier

7 Oct 2016 7 Oct '16

3:01 a.m.

(It wasn't just you)

Gerrit was down today starting around 17:49 UTC. It is now back up and services are coming back online.

A full investigation into the cause of the outage is still on-going.[0]

Apologies for the downtime.

WMF Release Engineering

[0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006 But this is missing a lot of the information/discussion that is happening in #wikimedia-operations on Freenode. A link to the incident report will be pasted into that etherpad when it is created.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |

Show replies by date

Chad Horohoe

7 Oct 7 Oct

4:03 a.m.

New subject: [Engineering] Gerrit was down today

Hi!

Sorry for the extended downtime! From what we can tell, it appears as though the machine that Gerrit is running on (lead) is having some hardware issues that are making the CPU misbehave. We've worked around it for now, so things should be up (and Zuul is processing CI events just fine).

However, since it appears it's a hardware problem, we're planning to migrate off of lead to a new machine (cobalt). The public IP addresses will not be changing. The plan right now is to do this migration tomorrow with a scheduled downtime at 17:00UTC (10:00 PST).

We'll be keeping a close eye on things in the meantime, so if things deteriorate again we can start the migration sooner.

(and yeah, wikitech incident report to follow, I'm a little burnt out right now though)

Thanks again for bearing with us!

-Chad

On Thu, Oct 6, 2016 at 2:32 PM Greg Grossmeier greg@wikimedia.org wrote:

...

(It wasn't just you)

Gerrit was down today starting around 17:49 UTC. It is now back up and services are coming back online.

A full investigation into the cause of the outage is still on-going.[0]

Apologies for the downtime.

WMF Release Engineering

[0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006 But this is missing a lot of the information/discussion that is happening in #wikimedia-operations on Freenode. A link to the incident report will be pasted into that etherpad when it is created.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

Gergo Tisza

4:26 a.m.

New subject: [Engineering] Gerrit was down today

Thanks a lot for the quick recovery!

Would it be possible to use something other than a redirect next time when traffic needs to be blocked? An apache deny rule or a 404 would work, but a redirect means that reloading the page (or reopening the browser) will cause the URL to be lost with little hope of recovery (browsers don't record redirects in the history). That can be very annoying when one uses tabs as bookmarks (bad habit as it is).

On Thu, Oct 6, 2016 at 3:33 PM, Chad Horohoe chorohoe@wikimedia.org wrote:

...

Hi!

Sorry for the extended downtime! From what we can tell, it appears as though the machine that Gerrit is running on (lead) is having some hardware issues that are making the CPU misbehave. We've worked around it for now, so things should be up (and Zuul is processing CI events just fine).

However, since it appears it's a hardware problem, we're planning to migrate off of lead to a new machine (cobalt). The public IP addresses will not be changing. The plan right now is to do this migration tomorrow with a scheduled downtime at 17:00UTC (10:00 PST).

We'll be keeping a close eye on things in the meantime, so if things deteriorate again we can start the migration sooner.

(and yeah, wikitech incident report to follow, I'm a little burnt out right now though)

Thanks again for bearing with us!

Amir Ladsgroup

4:43 a.m.

New subject: [Engineering] Gerrit was down today

It was bothering to me but I'm guessing this is one of so so many flaws of gerrit itself and probably not fixable easily (other people are more qualified to comment) but i want to suggest speeding up the process to move to differential which is much better in handling such down times alongside with other benefits.

Best

On Fri, Oct 7, 2016, 2:26 AM Gergo Tisza gtisza@wikimedia.org wrote:

...

Thanks a lot for the quick recovery!

Would it be possible to use something other than a redirect next time when traffic needs to be blocked? An apache deny rule or a 404 would work, but a redirect means that reloading the page (or reopening the browser) will cause the URL to be lost with little hope of recovery (browsers don't record redirects in the history). That can be very annoying when one uses tabs as bookmarks (bad habit as it is).

On Thu, Oct 6, 2016 at 3:33 PM, Chad Horohoe chorohoe@wikimedia.org wrote:

...
Hi!

Sorry for the extended downtime! From what we can tell, it appears as though the machine that Gerrit is running on (lead) is having some hardware issues that are making the CPU misbehave. We've worked around it for now, so things should be up (and Zuul is processing CI events just fine).

However, since it appears it's a hardware problem, we're planning to migrate off of lead to a new machine (cobalt). The public IP addresses will not be changing. The plan right now is to do this migration tomorrow with a scheduled downtime at 17:00UTC (10:00 PST).

We'll be keeping a close eye on things in the meantime, so if things deteriorate again we can start the migration sooner.

(and yeah, wikitech incident report to follow, I'm a little burnt out right now though)

Thanks again for bearing with us!

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chad Horohoe

5:58 a.m.

New subject: [Engineering] Gerrit was down today

This is actually how we have Apache configured to respond to Gerrit being unavailable - that error page is served with a 503 when Gerrit is really down.

Today I hacked it to always show that page, so even when it was "up" people wouldn't be hitting it -- we were still debugging and restarting things so I didn't want to give false hopes or end up with half-completed transactions.

This can all be improved I think with some Apache config tweaks.

-Chad

On Thu, Oct 6, 2016 at 4:14 PM Amir Ladsgroup ladsgroup@gmail.com wrote:

...

It was bothering to me but I'm guessing this is one of so so many flaws of gerrit itself and probably not fixable easily (other people are more qualified to comment) but i want to suggest speeding up the process to move to differential which is much better in handling such down times alongside with other benefits.

Best

On Fri, Oct 7, 2016, 2:26 AM Gergo Tisza gtisza@wikimedia.org wrote:

Thanks a lot for the quick recovery!

Would it be possible to use something other than a redirect next time when traffic needs to be blocked? An apache deny rule or a 404 would work, but a redirect means that reloading the page (or reopening the browser) will cause the URL to be lost with little hope of recovery (browsers don't record redirects in the history). That can be very annoying when one uses tabs as bookmarks (bad habit as it is).

On Thu, Oct 6, 2016 at 3:33 PM, Chad Horohoe chorohoe@wikimedia.org wrote:

...
Hi!

Sorry for the extended downtime! From what we can tell, it appears as though the machine that Gerrit is running on (lead) is having some hardware issues that are making the CPU misbehave. We've worked around it for now, so things should be up (and Zuul is processing CI events just fine).

However, since it appears it's a hardware problem, we're planning to migrate off of lead to a new machine (cobalt). The public IP addresses will not be changing. The plan right now is to do this migration tomorrow with a scheduled downtime at 17:00UTC (10:00 PST).

We'll be keeping a close eye on things in the meantime, so if things deteriorate again we can start the migration sooner.

(and yeah, wikitech incident report to follow, I'm a little burnt out right now though)

Thanks again for bearing with us!

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Roan Kattouw

8 Oct 8 Oct

1:24 a.m.

Looks like it's down again? I was going to ask on IRC, but due to netsplits (caused by freenode maintenance), IRCCloud is down too.

(IRC and Gerrit both down... clearly I should just go to lunch now :) )

On Oct 6, 2016 14:32, "Greg Grossmeier" greg@wikimedia.org wrote:

...

(It wasn't just you)

Gerrit was down today starting around 17:49 UTC. It is now back up and services are coming back online.

A full investigation into the cause of the outage is still on-going.[0]

Apologies for the downtime.

WMF Release Engineering

[0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006 But this is missing a lot of the information/discussion that is happening in #wikimedia-operations on Freenode. A link to the incident report will be pasted into that etherpad when it is created.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Amir Ladsgroup

1:27 a.m.

Chad wrote: However, since it appears it's a hardware problem, we're planning to migrate off of lead to a new machine (cobalt). The public IP addresses will not be changing. The plan right now is to do this migration tomorrow with a scheduled downtime at 17:00UTC (10:00 PST).

TLDR: scheduled down time.

Best

On Fri, Oct 7, 2016 at 11:25 PM Roan Kattouw roan.kattouw@gmail.com wrote:

...

Looks like it's down again? I was going to ask on IRC, but due to netsplits (caused by freenode maintenance), IRCCloud is down too.

(IRC and Gerrit both down... clearly I should just go to lunch now :) )

On Oct 6, 2016 14:32, "Greg Grossmeier" greg@wikimedia.org wrote:

...
(It wasn't just you)

Gerrit was down today starting around 17:49 UTC. It is now back up and services are coming back online.

A full investigation into the cause of the outage is still on-going.[0]

Apologies for the downtime.

WMF Release Engineering

[0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006 But this is missing a lot of the information/discussion that is happening in #wikimedia-operations on Freenode. A link to the incident report will be pasted into that etherpad when it is created.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Chad Horohoe

1:35 a.m.

New subject: [Engineering] Gerrit was down today

Yeah, we're working on a migration right now. It didn't go as smoothly as I would have hoped.

Also Freenode is netsplitting which is not very helpful right now :(

Everything will be back soon!

-Chad

On Fri, Oct 7, 2016 at 12:55 PM Roan Kattouw roan.kattouw@gmail.com wrote:

...

Looks like it's down again? I was going to ask on IRC, but due to netsplits (caused by freenode maintenance), IRCCloud is down too.

(IRC and Gerrit both down... clearly I should just go to lunch now :) )

On Oct 6, 2016 14:32, "Greg Grossmeier" greg@wikimedia.org wrote:

(It wasn't just you)

Gerrit was down today starting around 17:49 UTC. It is now back up and services are coming back online.

A full investigation into the cause of the outage is still on-going.[0]

Apologies for the downtime.

WMF Release Engineering

[0] https://etherpad.wikimedia.org/p/gerrit-outage-20161006 But this is missing a lot of the information/discussion that is happening in #wikimedia-operations on Freenode. A link to the incident report will be pasted into that etherpad when it is created.

-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

Daniel Zahn

2:37 a.m.

New subject: [Engineering] Gerrit was down today

The Gerrit migration is over. It is back up and served from new server "cobalt" now. It feels faster than before as well. Thanks much to Brandon Black for help.

Chad Horohoe

12 Oct 12 Oct

10:29 p.m.

New subject: [Engineering] Gerrit was down today

Heya!

Gonna reboot Gerrit real quick this morning. Turns out "cobalt" did not have hyperthreading turned on. Services should be back momentarily!

-Chad

On Fri, Oct 7, 2016 at 2:07 PM Daniel Zahn dzahn@wikimedia.org wrote:

...

The Gerrit migration is over. It is back up and served from new server "cobalt" now. It feels faster than before as well. Thanks much to Brandon Black for help.

Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

2985

Age (days ago)

2991

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Amir Ladsgroup
Chad Horohoe
Daniel Zahn
Gergo Tisza
Greg Grossmeier
Roan Kattouw