Re: [WikiEN-l] Fwd: [Wikitech-l] Downtime this morning

16 Nov 2009


      nagios?
ganglia?
4-CPU apache?
scap?
swap?
memcached node?
<eyes glazing over>
Is it fixed now? Oh, good. :-)
Carcharoth
On Mon, Nov 16, 2009 at 3:04 PM, David Gerard dgerard@gmail.com wrote:
...
---------- Forwarded message ----------
From: Andrew Garrett agarrett@wikimedia.org
Date: 2009/11/16
Subject: [Wikitech-l] Downtime this morning
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Hi all,
There has been some downtime this morning (about 15 minutes) due to a
software update.
I pushed a software update, and immediately servers started crashing
according to nagios. Looking at ganglia, it looks like the issue was
the familiar issue where scap pushes a few 4-CPU apaches into swap,
which then crash and come back a few minutes later. This time,
however, obviously a key memcached node fell over, causing a database
overload, resulting in the site being mostly inaccessible for about
ten minutes.
I prepared to revert the software update, but determined that the
problem was not the software update, and a scap would exacerbate the
issue. The problem resolved itself spontaneously.
We need to fix things up so the scap script is less liable to push
machines into swap :)
--
Andrew Garrett
agarrett@wikimedia.org
http://werdn.us/

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [WikiEN-l] Fwd: [Wikitech-l] Downtime this morning