nagios? ganglia? 4-CPU apache? scap? swap? memcached node?
<eyes glazing over>
Is it fixed now? Oh, good. :-)
Carcharoth
On Mon, Nov 16, 2009 at 3:04 PM, David Gerard dgerard@gmail.com wrote:
---------- Forwarded message ---------- From: Andrew Garrett agarrett@wikimedia.org Date: 2009/11/16 Subject: [Wikitech-l] Downtime this morning To: Wikimedia developers wikitech-l@lists.wikimedia.org
Hi all,
There has been some downtime this morning (about 15 minutes) due to a software update.
I pushed a software update, and immediately servers started crashing according to nagios. Looking at ganglia, it looks like the issue was the familiar issue where scap pushes a few 4-CPU apaches into swap, which then crash and come back a few minutes later. This time, however, obviously a key memcached node fell over, causing a database overload, resulting in the site being mostly inaccessible for about ten minutes.
I prepared to revert the software update, but determined that the problem was not the software update, and a scap would exacerbate the issue. The problem resolved itself spontaneously.
We need to fix things up so the scap script is less liable to push machines into swap :)
-- Andrew Garrett agarrett@wikimedia.org http://werdn.us/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l