[WikiEN-l] Fwd: [Wikitech-l] Downtime this morning

Carcharoth carcharothwp at googlemail.com
Mon Nov 16 15:28:30 UTC 2009


nagios?
ganglia?
4-CPU apache?
scap?
swap?
memcached node?

<eyes glazing over>

Is it fixed now? Oh, good. :-)

Carcharoth

On Mon, Nov 16, 2009 at 3:04 PM, David Gerard <dgerard at gmail.com> wrote:
> ---------- Forwarded message ----------
> From: Andrew Garrett <agarrett at wikimedia.org>
> Date: 2009/11/16
> Subject: [Wikitech-l] Downtime this morning
> To: Wikimedia developers <wikitech-l at lists.wikimedia.org>
>
>
> Hi all,
>
> There has been some downtime this morning (about 15 minutes) due to a
> software update.
>
> I pushed a software update, and immediately servers started crashing
> according to nagios. Looking at ganglia, it looks like the issue was
> the familiar issue where scap pushes a few 4-CPU apaches into swap,
> which then crash and come back a few minutes later. This time,
> however, obviously a key memcached node fell over, causing a database
> overload, resulting in the site being mostly inaccessible for about
> ten minutes.
>
> I prepared to revert the software update, but determined that the
> problem was not the software update, and a scap would exacerbate the
> issue. The problem resolved itself spontaneously.
>
> We need to fix things up so the scap script is less liable to push
> machines into swap :)
>
> --
> Andrew Garrett
> agarrett at wikimedia.org
> http://werdn.us/
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> WikiEN-l mailing list
> WikiEN-l at lists.wikimedia.org
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>



More information about the WikiEN-l mailing list