---------- Forwarded message ---------- From: Andrew Garrett agarrett@wikimedia.org Date: 2009/11/16 Subject: [Wikitech-l] Downtime this morning To: Wikimedia developers wikitech-l@lists.wikimedia.org
Hi all,
There has been some downtime this morning (about 15 minutes) due to a software update.
I pushed a software update, and immediately servers started crashing according to nagios. Looking at ganglia, it looks like the issue was the familiar issue where scap pushes a few 4-CPU apaches into swap, which then crash and come back a few minutes later. This time, however, obviously a key memcached node fell over, causing a database overload, resulting in the site being mostly inaccessible for about ten minutes.
I prepared to revert the software update, but determined that the problem was not the software update, and a scap would exacerbate the issue. The problem resolved itself spontaneously.
We need to fix things up so the scap script is less liable to push machines into swap :)
-- Andrew Garrett agarrett@wikimedia.org http://werdn.us/
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
nagios? ganglia? 4-CPU apache? scap? swap? memcached node?
<eyes glazing over>
Is it fixed now? Oh, good. :-)
Carcharoth
On Mon, Nov 16, 2009 at 3:04 PM, David Gerard dgerard@gmail.com wrote:
---------- Forwarded message ---------- From: Andrew Garrett agarrett@wikimedia.org Date: 2009/11/16 Subject: [Wikitech-l] Downtime this morning To: Wikimedia developers wikitech-l@lists.wikimedia.org
Hi all,
There has been some downtime this morning (about 15 minutes) due to a software update.
I pushed a software update, and immediately servers started crashing according to nagios. Looking at ganglia, it looks like the issue was the familiar issue where scap pushes a few 4-CPU apaches into swap, which then crash and come back a few minutes later. This time, however, obviously a key memcached node fell over, causing a database overload, resulting in the site being mostly inaccessible for about ten minutes.
I prepared to revert the software update, but determined that the problem was not the software update, and a scap would exacerbate the issue. The problem resolved itself spontaneously.
We need to fix things up so the scap script is less liable to push machines into swap :)
-- Andrew Garrett agarrett@wikimedia.org http://werdn.us/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Hehe. Yeah, I'm with Carcharoth. Not sure what any of that meant Andrew, but it sounds important. Glad it's back up. Thanks for keeping us informed.
Just like there's a certain amount of (anti)prestige associated with being one of the admins who've managed to delete the mainpage http://en.wikipedia.org/w/index.php?title=Special:Log&page=Main_Page is there also a barnstar for being a techie who has unintentionally taken the whole site down? :-)
As Brion says "The internet is burning!"
-Liam [[witty lama]]
wittylama.com/blog Peace, love & metadata
On Tue, Nov 17, 2009 at 2:28 AM, Carcharoth carcharothwp@googlemail.comwrote:
nagios? ganglia? 4-CPU apache? scap? swap? memcached node?
<eyes glazing over>
Is it fixed now? Oh, good. :-)
Carcharoth
On Mon, Nov 16, 2009 at 3:04 PM, David Gerard dgerard@gmail.com wrote:
---------- Forwarded message ---------- From: Andrew Garrett agarrett@wikimedia.org Date: 2009/11/16 Subject: [Wikitech-l] Downtime this morning To: Wikimedia developers wikitech-l@lists.wikimedia.org
Hi all,
There has been some downtime this morning (about 15 minutes) due to a software update.
I pushed a software update, and immediately servers started crashing according to nagios. Looking at ganglia, it looks like the issue was the familiar issue where scap pushes a few 4-CPU apaches into swap, which then crash and come back a few minutes later. This time, however, obviously a key memcached node fell over, causing a database overload, resulting in the site being mostly inaccessible for about ten minutes.
I prepared to revert the software update, but determined that the problem was not the software update, and a scap would exacerbate the issue. The problem resolved itself spontaneously.
We need to fix things up so the scap script is less liable to push machines into swap :)
-- Andrew Garrett agarrett@wikimedia.org http://werdn.us/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Carcharoth carcharothwp@googlemail.com wrote:
nagios? ganglia? 4-CPU apache? scap? swap? memcached node?
<eyes glazing over> Is it fixed now? Oh, good. :-)
Off the top of my head...
"Nagios" is ostensibly the report server and caching manager and "ganglia" IIRC is a page caching manager. "Caching" basically just means keeping wiki pages in RAM so that things get fetched quickly - most people are not logged in so they get the same HTML and reuse the same CSS.
The main concept was that the error not only caused caching servers that were supposed to keep pages in RAM had to dump these pages into swap memory, but it affected a main caching node through which other nodes... do stuff... apparently. "Memcached" is the name of the caching software, or rather one of them, and the first one implemented here. IIRC it was first developed for /. (?), and kind of kept WP barely alive through the great traffic growth spurts of 04 and 05.
I looked up "scap" and still dunno what it is.
Again, that's just off the cuff. Don't take anything seriously.
-Stevertigo
stevertigo wrote:
Carcharoth carcharothwp@googlemail.com wrote:
nagios? ganglia? 4-CPU apache? scap? swap? memcached node?
<eyes glazing over> Is it fixed now? Oh, good. :-)
Off the top of my head...
"Nagios" is ostensibly the report server and caching manager and "ganglia" IIRC is a page caching manager.
Actually neither of them are "caching managers" or have any direct role in caching. This isn't the forum to go into a detailed discussion of what they do mean, and a google search would do just as well to fill Carcharoth in, if he was actually interested, which he obviously isn't.
"Caching" basically just means keeping wiki pages in RAM so that things get fetched quickly - most people are not logged in so they get the same HTML and reuse the same CSS.
This is not particularly accurate either.
The main concept was that the error not only caused caching servers that were supposed to keep pages in RAM had to dump these pages into swap memory, but it affected a main caching node through which other nodes... do stuff... apparently. "Memcached" is the name of the caching software, or rather one of them, and the first one implemented here. IIRC it was first developed for /. (?),
I think you mean LiveJournal.
and kind of kept WP barely alive through the great traffic growth spurts of 04 and 05.
I looked up "scap" and still dunno what it is.
http://wikitech.wikimedia.org/view/Scap
Liam Wyatt wrote:
Just like there's a certain amount of (anti)prestige associated with being one of the admins who've managed to delete the mainpage http://en.wikipedia.org/w/index.php?title=Special:Log&page=Main_Page is there also a barnstar for being a techie who has unintentionally taken the whole site down? :-)
We don't make a big deal of it. Unlike deleting the main page, crashing the site is an easy mistake to make.
-- Tim Starling
On Tue, Nov 17, 2009 at 7:05 AM, Tim Starling tstarling@wikimedia.org wrote:
stevertigo wrote:
Carcharoth carcharothwp@googlemail.com wrote:
nagios? ganglia? 4-CPU apache? scap? swap? memcached node?
<eyes glazing over> Is it fixed now? Oh, good. :-)
Off the top of my head...
"Nagios" is ostensibly the report server and caching manager and "ganglia" IIRC is a page caching manager.
Actually neither of them are "caching managers" or have any direct role in caching. This isn't the forum to go into a detailed discussion of what they do mean, and a google search would do just as well to fill Carcharoth in, if he was actually interested, which he obviously isn't.
Why, thank you. :-)
Sometimes it is nice to get explanations (understandable ones) from real people, not webpages.
This is why we have a Reference Desk and (possibly) mailing lists.
And to be honest, if I had Googled myself some understanding of this, I may have ended up even more confused about it. If I had asked questions like this on the wikitech-l mailing list, would I have been told to Google the answer? Hey, maybe the Reference Desk *is* the right place for questions like this?
Carcharoth
On Tue, Nov 17, 2009 at 7:25 AM, Carcharoth carcharothwp@googlemail.com wrote:
And to be honest, if I had Googled myself some understanding of this, I may have ended up even more confused about it. If I had asked questions like this on the wikitech-l mailing list, would I have been told to Google the answer?
No, but you didn't. The message was originally posted on wikitech-l, and wasn't meant for wikien-l. It was a brief technical summary of why the site crashed, directed toward people who know about the site architecture and would get useful information from the description. Explaining what all the terms mean wouldn't really serve the purpose of the original message, which was to inform other people knowledgeable about and/or responsible for the site's operation so that the problem could be kept in mind in case it happened again, etc. I don't know what the point was of forwarding it to wikien-l, since it doesn't contain any information that's useful to users.
The technical details are, in something closer to laymen's terms: Andrew updated the software (scapped), and some servers crashed. Looking at various monitoring tools (Nagios/Ganglia), he figured out that some of the older, less powerful (4-CPU) application servers (Apaches) didn't have enough resources to handle the update properly (went into swap). Unfortunately, this somehow (I'm not clear on this part) drove an important caching server (memcached node) to crash also. The reduction in caching caused the database to overload, as requests that normally would have been cached had to go to the database. This made the site mostly inaccessible for about ten minutes, until the caches were repopulated enough to reduce database load to normal levels.
As you can see, this doesn't really contain any info useful to anyone but server admins. Which is why it was originally posted to wikitech-l, not wikien-l.
On Tue, Nov 17, 2009 at 3:07 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
<snip>
As you can see, this doesn't really contain any info useful to anyone but server admins. Which is why it was originally posted to wikitech-l, not wikien-l.
True, but thanks for explaining anyway. Much appreciated, and I do find it interesting, even if my original post in this thread and some of the responses to it said or implied I didn't. I've just been reading the Wikipedia articles on naglios and ganglia, and they do help quite a bit in understanding things.
Carcharoth
Carcharoth wrote:
On Tue, Nov 17, 2009 at 3:07 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
<snip>
As you can see, this doesn't really contain any info useful to anyone but server admins. Which is why it was originally posted to wikitech-l, not wikien-l.
True, but thanks for explaining anyway. Much appreciated, and I do find it interesting, even if my original post in this thread and some of the responses to it said or implied I didn't.
The most important thing is to decide what we are going to cross-post to wikitech-l to induce equal bafflement. Something involving 57 different flavours of idiosyncratic interpretation of IAR, and who thinks that they could be safely ignored, might do.
Charles
On Tue, Nov 17, 2009 at 4:11 PM, Charles Matthews charles.r.matthews@ntlworld.com wrote:
<snip>
The most important thing is to decide what we are going to cross-post to wikitech-l to induce equal bafflement. Something involving 57 different flavours of idiosyncratic interpretation of IAR, and who thinks that they could be safely ignored, might do.
I nominate your post on Tottel's Miscellany... :-)
Is it really possible to have 57 different flavours of IAR? Actually, I don't want to know the answer to that one!
Carcharoth
On Tue, Nov 17, 2009 at 11:11 AM, Charles Matthews charles.r.matthews@ntlworld.com wrote:
The most important thing is to decide what we are going to cross-post to wikitech-l to induce equal bafflement. Something involving 57 different flavours of idiosyncratic interpretation of IAR, and who thinks that they could be safely ignored, might do.
The original message was not cross-posted. Andrew posted it on wikitech-l only.
Tim Starling tstarling@wikimedia.org wrote:
Actually neither of them are "caching managers" or have any direct role in caching.
OK. "Various monitoring tools" is sort of sufficient.
Stevertigo wrote:
"Caching" basically just means keeping wiki pages in RAM so that things get fetched quickly - most people are not logged in so they get the same HTML
Tim Starling wrote:
This is not particularly accurate either.
OK. Well (again, just pulling this out of my ear) either its the wikitext or HTML that is cached, and it didn't seem to me like it made sense to reformat each page as HTML each time it was called. Granted, HTML is bulkier and takes up more RAM (50% more?), and that probably outweighs the load/computensity issue of doing reformatting for each time. (Something I didn't consider, as I was pulling... )
So, given what I assume is a rather steepish ratio of casual readers (who need standard pages) to logged in editors (who need customized pages), I just went with the HTML and figure the rest was smoke and mirrors.
I didn't think though about how wikitext changes all the time, and thus handling them as HTML would probably add a static element to how pages are refreshed. Hm.
Anyway, sorry if I was less than accurate in my explanation.
I think you mean LiveJournal.
Ah. True.
-Stevertigo "I know I'll keep searching...
Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
The original message was not cross-posted. Andrew posted it on wikitech-l only.
Well if a 17 minute site crash is "no big deal" as Tim said.. Then a little casual informative crossposting ain't worth frettin over, eh?
-Stevertigo "You're hopelessly hopeless...
2009/11/17 Aryeh Gregor Simetrical+wikilist@gmail.com:
As you can see, this doesn't really contain any info useful to anyone but server admins. Which is why it was originally posted to wikitech-l, not wikien-l.
If of some interest though. http://ganglia.wikimedia.org/ is also of general interest if you are trying to work out if a problem is just you or wikipedia. It can also be of general interest with regards to traffic patterns (for example the variation in page views seems to be far less than the variation in image and text uploads).