2006 Apr 20 drastic slowdown postmortem

List overview All Threads
Download

newer

older

Image server NFS recovery

MediaWiki automated test run...

William Allen Simpson

21 Apr 2006 21 Apr '06

8:46 p.m.

Time to time, I'm sure we've all noticed that wikipedia slows to a crawl. Such as last night (local time), for about 15-20 minutes, reading was poor, writing was nearly impossible, see:

April 20, 2006, 10:36 pm http://www.thewritingpot.com/wikistatus/

I tried looking at the site from various views. What struck me was that no matter where I looked from here in the US, east or west or central, all traffic seems to go to Florida, even when the servers are not responding.

No failover to other clusters?

Also, the DNS stopped serving inverse addresses. Compare:

9 ae-23-54.car3.tampa1.level3.net (4.68.104.107) 222.648 ms ae-13-55.car3.tampa1.level3.net (4.68.104.139) 221.783 ms ae-13-53.car3.tampa1.level3.net (4.68.104.75) 223.539 ms 10 level3-co1.tpax.as30217.net (4.71.0.10) 224.125 ms 222.308 ms 223.698 ms 11 e1-1.dr1.tpax.as30217.net (84.40.24.22) 230.567 ms 222.853 ms 227.562 ms 12 gi0-50.csw1-pmtpa.wikimedia.org (64.156.25.242) 222.394 ms 223.082 ms 223.56 ms 13 rr-206.pmtpa.wikimedia.org (207.142.131.206) 225.189 ms 215.542 ms 224.085 ms

11 ae-23-54.car3.Tampa1.Level3.net (4.68.104.107) 51.362 ms ae-13-51.car3.Tampa1.Level3.net (4.68.104.11) 51.299 ms ae-13-53.car3.Tampa1.Level3.net (4.68.104.75) 51.291 ms 12 level3-co1.tpax.as30217.net (4.71.0.10) 54.396 ms 53.682 ms 53.826 ms 13 84.40.24.22 (84.40.24.22) 54.127 ms 53.686 ms 53.826 ms 14 gi0-50.csw1-pmtpa.wikimedia.org (64.156.25.242) 59.873 ms 58.579 ms 55.517 ms 15 rr-235.pmtpa.wikimedia.org (207.142.131.235) 53.879 ms 54.104 ms 53.891 ms

That 84.40.24.22 inverse is only at 2 DNServers both located on the same subnet (very bad practice):

;; ANSWER SECTION: 22.24.40.84.in-addr.arpa. 200 IN PTR e1-1.dr1.tpax.as30217.net.

;; AUTHORITY SECTION: 24.40.84.in-addr.arpa. 200 IN NS rns1.powermedium.com. 24.40.84.in-addr.arpa. 200 IN NS rns2.powermedium.com.

;; ADDITIONAL SECTION: rns1.powermedium.com. 14128 IN A 84.40.24.94 rns2.powermedium.com. 14128 IN A 84.40.24.98

However, that loss of DNS responses from the same subnet leads to the conclusion the subnet might be under congestive collapse. That is, this lag might not be produced by wikimedia itself, but a problem with the link to or within the facility.

Is there any other data that might correspond?

Does anybody have clues or notes on what actually might have been happening at the time? RTG/MRTG?

- - - Also, I'm seeing incorrect DNS configuration (CNAME to CNAME): ;; ANSWER SECTION: en.wikipedia.org. 92 IN CNAME rr.wikimedia.org. rr.wikimedia.org. 600 IN CNAME rr.pmtpa.wikimedia.org.

;; AUTHORITY SECTION: wikimedia.org. 7200 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2006041914 43200 7200 1209600 3600

Show replies by date

Christopher Beland

21 Apr 21 Apr

9:14 p.m.

...

Does anybody have clues or notes on what actually might have been happening at the time? RTG/MRTG?

There is some tracking data; see the links from: http://meta.wikimedia.org/wiki/Wikimedia_servers#Status_and_problems

-B.

Platonides

10:31 p.m.

Those links seem to not being working.

...

There is some tracking data; see the links from: http://meta.wikimedia.org/wiki/Wikimedia_servers#Status_and_problems

William Allen Simpson

22 Apr 22 Apr

4:22 a.m.

Platonides wrote:

...

Those links seem to not being working.

...
There is some tracking data; see the links from: http://meta.wikimedia.org/wiki/Wikimedia_servers#Status_and_problems

One link there works: http://ganglia.wikimedia.org/

But that was enough, at least for symptoms. I have no idea what it means. There's no corresponding admin log for 2:30 UTC....

"New Squids", a clear notch (drastic drop of CPU, load, and network traffic, almost to zero).

"Apaches", a corresponding halving of CPU, load, and output (but not input) network traffic.

"MySQL", vast increase (3-5 times) of load, doubling of CPU, but no change in network traffic (that comes a couple of hours later, probably unrelated, the admin log shows a copy of db2 to db3).

The MySQL servers that account for the change are ariel, db1, db2, db3, and db4, with a smaller bump at samuel. The others seem unaffected.

Tim Starling

8:21 p.m.

William Allen Simpson wrote:

...

Time to time, I'm sure we've all noticed that wikipedia slows to a crawl. Such as last night (local time), for about 15-20 minutes, reading was poor, writing was nearly impossible, see:

April 20, 2006, 10:36 pm http://www.thewritingpot.com/wikistatus/

What is "local time"? Please state your times in UTC. The page you link to doesn't go back as far as April 20, and it doesn't appear to have any archive links.

In any case, there's not much point in complaining about slow response times a day after the fact. As I told you before, the best place to contribute to this sort of thing is on #wikimedia-tech.

http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034991.html

...

I tried looking at the site from various views. What struck me was that no matter where I looked from here in the US, east or west or central, all traffic seems to go to Florida, even when the servers are not responding.

No failover to other clusters?

There are no other clusters which fill the same role as pmtpa. Go to this page:

http://meta.wikimedia.org/wiki/Profiling/20051208

and tell me how fast the site would be if every one of those Database::query or memcached::get calls required a couple of transatlantic RTTs. Using centralised caches improves the hit rate, and keeping them within a few kilometres of the apache servers makes the latency acceptable.

...

Also, the DNS stopped serving inverse addresses. Compare:

[...]

...

That 84.40.24.22 inverse is only at 2 DNServers both located on the same subnet (very bad practice):

Maybe you should complain to whoever owns those servers.

[...]

...

However, that loss of DNS responses from the same subnet leads to the conclusion the subnet might be under congestive collapse. That is, this lag might not be produced by wikimedia itself, but a problem with the link to or within the facility.

I very much doubt it. Did you try testing for packet loss by pinging a Wikimedia server?

...

Is there any other data that might correspond?

Does anybody have clues or notes on what actually might have been happening at the time? RTG/MRTG?

Our MRTG stuff is still down following the loss of larousse, but you can still use these:

http://ganglia.wikimedia.org/ http://tools.wikimedia.de/~leon/stats/reqstats/ https://wikitech.leuksman.com/view/Server_admin_log

-- Tim Starling

William Allen Simpson

23 Apr 23 Apr

8:51 a.m.

Tim Starling wrote:

...

What is "local time"? Please state your times in UTC. The page you link to doesn't go back as far as April 20, and it doesn't appear to have any archive links.

Sorry, had no idea that site didn't keep an archive.

...

In any case, there's not much point in complaining about slow response times a day after the fact. As I told you before, the best place to contribute to this sort of thing is on #wikimedia-tech.

http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034991.html

Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.

Besides, IRC isn't a very conducive for planning, email exchange is much preferable.

...

There are no other clusters which fill the same role as pmtpa. Go to this page:

http://meta.wikimedia.org/wiki/Profiling/20051208

For failover, every cluster needs its own copy of the SQL database (slaved), and its own apache servers, and its own squid.

After all, ISP customers aren't calling support because they cannot edit, it's because they aren't getting pages served.

...

and tell me how fast the site would be if every one of those Database::query or memcached::get calls required a couple of transatlantic RTTs. Using centralised caches improves the hit rate, and keeping them within a few kilometres of the apache servers makes the latency acceptable.

Strawman. Are the Tampa apache's using some sort of memcache shared between them? Then, how do the Seoul apache's share that?

...

...
Also, the DNS stopped serving inverse addresses. Compare:

[...]

...
That 84.40.24.22 inverse is only at 2 DNServers both located on the same subnet (very bad practice):

Maybe you should complain to whoever owns those servers.

Since they appear to be serving your net, apparently you either own them, or you are paying for them one way or another.

I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart, presumably your clusters. Good practice.

...

...
However, that loss of DNS responses from the same subnet leads to the conclusion the subnet might be under congestive collapse. That is, this lag might not be produced by wikimedia itself, but a problem with the link to or within the facility.

I very much doubt it. Did you try testing for packet loss by pinging a Wikimedia server?

Yes, of course, for most folks that's the first thing to do! (100% loss.) Then, traceroutes from various looking glasses to see whether the problem is path specific. (Showed a couple of those earlier.)

Again, something caused all the squid and apaches to stop getting bytes and packets in. I saved the ganglia .gifs, would you prefer I sent them as attachments?

...

Our MRTG stuff is still down following the loss of larousse, but you can still use these:

http://ganglia.wikimedia.org/ http://tools.wikimedia.de/~leon/stats/reqstats/ https://wikitech.leuksman.com/view/Server_admin_log

That ganglia is RRDTool, which isn't too bad. Would be nice to see the interface byte and packet counts for the switches and upstream routers. That would have told more about the bottleneck, assuming it was a link issue. Could have been something else, but hard to know without data.

In this case, the dip shows up on all clusters, even though it probably only affected Tampa. That's because all measurement is from one place.

Whenever I've setup a POP, I like to have an NTP chimer, MRTG, and a separate DNS instance all running (usually on the same box). That way, even when the main site is down, the others are still running and collecting data. I find that customers may not like the fact the mail servers are down, but as long as they can still fetch data from elsewhere, they're less likely to be completely unhappy.

You've got a bastion at several clusters, where would the documentation be for what you're running at each?

I've looked at https://wikitech.leuksman.com/view/All_servers, but its hopelessly sparse (and out of date).

Tim Starling

12:27 p.m.

William Allen Simpson wrote:

...

Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.

Since I still don't have a clue what you're talking about, preventing it from happening in the future might be difficult. I'll ask again: what is "local time"? Which local time are you talking about?

...

Besides, IRC isn't a very conducive for planning, email exchange is much preferable.

...
There are no other clusters which fill the same role as pmtpa. Go to this page:

http://meta.wikimedia.org/wiki/Profiling/20051208

For failover, every cluster needs its own copy of the SQL database (slaved), and its own apache servers, and its own squid.

After all, ISP customers aren't calling support because they cannot edit, it's because they aren't getting pages served.

...
and tell me how fast the site would be if every one of those Database::query or memcached::get calls required a couple of transatlantic RTTs. Using centralised caches improves the hit rate, and keeping them within a few kilometres of the apache servers makes the latency acceptable.

Strawman. Are the Tampa apache's using some sort of memcache shared between them? Then, how do the Seoul apache's share that?

Don't give me this "strawman" crap. You've been here 2 weeks and you think you know the site better than I do? Unless you're willing to treat the existing sysadmin team with respect it deserves, I'm not interested in dealing with you.

The yaseo apaches serve jawiki, mswiki, thwiki and kowiki. The memcached cluster for those 4 wikis is also located in yaseo. We discussed allowing remote apaches to serve read requests from a local slave database, proxying write requests back to the location of the master database. The problem is that cache writes and invalidations are required even on read requests.

While distributed shared memory systems with cache coherency and asynchronous write operations have been implemented several times, especially in academic circles, I'm yet to find one which is suitable for production use in a web application such as MediaWiki. When you take into account that certain kinds of cache invalidation must be synchronised with database writes and squid cache purges, the problem of distribution, taken as a whole, would be a significant project.

Last year, we discussed the possibility of setting up a second datacentre within the US. But it was clear that centralisation, at least on a per-wiki level, gives the best performance for a given outlay, especially when development time and manageability are taken into account. Of course this performance comes at the expense of reliability. But Domas assured us that it is possible to obtain high availability with a single datacentre, as long as proper attention is paid to internal redundancy.

With the two recent power failures, it's clear that proper attention wasn't paid, but that's another story.

Automatic failover to a read-only mirror would be much easier than true distribution, but I don't think we have the hardware to support such a high request rate, outside of pmtpa.

In the end it comes down to a trade-off between costs and availability. Given the non-critical nature of our service, and the nature of our funding, I think it's prudent to accept, say, a few hours of downtime once every few months, in exchange for much lower hardware, development and management costs. If PowerMedium can't provide this level of service despite being paid good money, I think we should find a facility that can.

...

I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart, presumably your clusters. Good practice.

Don't be patronising.

...

...
...
However, that loss of DNS responses from the same subnet leads to the conclusion the subnet might be under congestive collapse. That is, this lag might not be produced by wikimedia itself, but a problem with the link to or within the facility.

I very much doubt it. Did you try testing for packet loss by pinging a Wikimedia server?

Yes, of course, for most folks that's the first thing to do! (100% loss.) Then, traceroutes from various looking glasses to see whether the problem is path specific. (Showed a couple of those earlier.)

Again, something caused all the squid and apaches to stop getting bytes and packets in. I saved the ganglia .gifs, would you prefer I sent them as attachments?

If the external network was down for 20 minutes then it's PowerMedium's problem. They probably lost a router or something. I have better things to worry about.

...

You've got a bastion at several clusters, where would the documentation be for what you're running at each?

I've looked at https://wikitech.leuksman.com/view/All_servers, but its hopelessly sparse (and out of date).

If it's not there then it probably doesn't exist.

-- Tim Starling

William Allen Simpson

5:59 p.m.

Tim Starling wrote:

...

William Allen Simpson wrote:

...
Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.

Since I still don't have a clue what you're talking about, preventing it from happening in the future might be difficult. I'll ask again: what is "local time"? Which local time are you talking about?

Since the thread hasn't had the words "local time" for some time, it was hard to figure out your query. Going back to my first message, there is a "local time" in parentheses. In context, it is clear that "last night (local time)" relates to your data center. That is, local night.

Had you looked at the graphs (or other logs) at the time, or anytime in the following day (all I would hope), the data was obvious. Since you didn't, I've attached some of the dozen or so that I saved.

The time was 02:30+ UTC.

The load, CPU, and network dropped off the Squids, and the Apaches (network incoming stayed the same, outgoing dropped), at the same time that SQL load and CPU leaped (network showed mild incoming decrease and outgoing increase, the inverse of the apaches). Told you exactly which servers.

No corresponding note in the admin log. Perhaps somebody remembers doing something unusual at the time?

...

Don't give me this "strawman" crap. You've been here 2 weeks and you think you know the site better than I do? Unless you're willing to treat the existing sysadmin team with respect it deserves, I'm not interested in dealing with you.

Never said that I did. That's why I've been asking questions. The level of site documentation is excrable.

However, it would be nicer for you to treat folks offering help with the respect *they* deserve. After all, I do happen to have 30+ years of experience in the field, organized the state government funding for NSFnet (the academic precursor to the Internet) 20 years ago, was an original member of the North American Network Operators Group (NANOG), have written a fair few Internet standards over the years, among other things.

http://www.google.com/search?q=%22William+Allen+Simpson%22

...

The yaseo apaches serve jawiki, mswiki, thwiki and kowiki. The memcached cluster for those 4 wikis is also located in yaseo. We discussed allowing remote apaches to serve read requests from a local slave database, proxying write requests back to the location of the master database. The problem is that cache writes and invalidations are required even on read requests.

Yes, this is obvious and well-known. They are just caches, improvements in local efficiency.

...

While distributed shared memory systems with cache coherency and asynchronous write operations have been implemented several times, especially in academic circles, I'm yet to find one which is suitable for production use in a web application such as MediaWiki. When you take into account that certain kinds of cache invalidation must be synchronised with database writes and squid cache purges, the problem of distribution, taken as a whole, would be a significant project.

Amazingly, I happen to be sitting just 1 1/2 blocks from one of those "academic circles", the Center for Information Techology Integration of the University of Michigan in Ann Arbor, Michigan.

...

Last year, we discussed the possibility of setting up a second datacentre within the US. But it was clear that centralisation, at least on a per-wiki level, gives the best performance for a given outlay, especially when development time and manageability are taken into account. Of course this performance comes at the expense of reliability. But Domas assured us that it is possible to obtain high availability with a single datacentre, as long as proper attention is paid to internal redundancy.

Yes, faster, cheaper, better; pick two (as the old saying goes).

Not knowing "Domas" (or whether that's a name or a company), I'm not sure of the basis for the assurance. Had you checked with other sites, I'm pretty sure you'd have heard that reliability from a single data center is extremely unlikely.

...

With the two recent power failures, it's clear that proper attention wasn't paid, but that's another story.

No, that's the same old story. It's practically guaranteed.

...

Automatic failover to a read-only mirror would be much easier than true distribution, but I don't think we have the hardware to support such a high request rate, outside of pmtpa.

Agreed. So, it's probably time to think about fixing that problem.

...

In the end it comes down to a trade-off between costs and availability. Given the non-critical nature of our service, and the nature of our funding, I think it's prudent to accept, say, a few hours of downtime once every few months, in exchange for much lower hardware, development and management costs. If PowerMedium can't provide this level of service despite being paid good money, I think we should find a facility that can.

Agreed.

...

...
I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart, presumably your clusters. Good practice.

Don't be patronising.

So, when I'm asking critical questions, I'm not giving you the respect you deserve, but by giving you an "attaboy", I'm patronizing?

Sounds like somebody is lacking some social graces.

I'll just note in passing that the current documentation lists https://wikitech.leuksman.com/view/DNS * ns0.wikimedia.org - 207.142.131.207 (secondary IP on zwinger) * ns1.wikimedia.org - 207.142.131.208 (larousse) * ns2.wikimedia.org - 145.97.39.158 (secondary IP on pascal)

You know, that bad practice of having 2 on the same subnet, mentioned a couple of messages back.... So, the note was supposed to be encouragement, notwithstanding that the documentation is wrong. The only reason I know that it's been improved is by a bit of archeology with dig.

...

If the external network was down for 20 minutes then it's PowerMedium's problem. They probably lost a router or something. I have better things to worry about.

The external network losses correspond to huge peaks in the MySQL graphs. So, I doubt you have better things to worry about -- that appears to be congestive collapse caused by something happening within your servers.

Even the loss of a router or switch or link is of concern, especially coupled with other problems such as the loss of power. Not knowing your SLA, it may be a refund is due.

Anyway, I thought a postmortem was in order.... Professionals do that kind of thing.

Anders Wegge Jakobsen

9:02 p.m.

William Allen Simpson william.allen.simpson@gmail.com writes:

...

However, it would be nicer for you to treat folks offering help with the respect *they* deserve. After all, I do happen to have 30+ years of experience in the field, organized the state government funding for NSFnet (the academic precursor to the Internet) 20 years ago, was an original member of the North American Network Operators Group (NANOG), have written a fair few Internet standards over the years, among other things.

With all due respect for your qualifications, you lack the prime skill needed: Know who and what you are dealing with, before trying to change things.

...

Not knowing "Domas" (or whether that's a name or a company), I'm not sure of the basis for the assurance. Had you checked with other sites, I'm pretty sure you'd have heard that reliability from a single data center is extremely unlikely.

Not knowing domas does not exactly qualifies for preferential treatment.

-- // Wegge Weblog: http://blog.wegge.dk Wiki: http://wiki.wegge.dk

Rob Church

10:08 p.m.

On 23 Apr 2006 17:02:14 +0200, Anders Wegge Jakobsen wegge@wegge.dk wrote:

...

...
Not knowing "Domas" (or whether that's a name or a company), I'm not sure of the basis for the assurance. Had you checked with other sites, I'm pretty sure you'd have heard that reliability from a single data center is extremely unlikely.

Not knowing domas does not exactly qualifies for preferential treatment.

All right, let's sort that out, then. "Domas" is Domas Mituzas, one of our core operations staff; I'd describe him as our resident performance nut, and one of our main database admins. He handles a lot of hardware purchasing and is one of the group of people who beat the site back into submission/working on a routine basis. Of note is the fact that he works for MySQL AB, although I confess I don't recall his precise position.

No doubt I left bits of the description out; feel free to amend it. :)

Rob Church

William Allen Simpson

24 Apr 24 Apr

1:26 a.m.

Anders Wegge Jakobsen wrote:

...

With all due respect for your qualifications, you lack the prime skill needed: Know who and what you are dealing with, before trying to change things.

So far, I've not asked for a specific change. I've been asking for pointers to documentation and monitoring, and posting my observations.

The other items you mentioned are more appropriate to a gang, kinda like a bad movie: "You don't know who you're dealing with, kid."

...

Not knowing domas does not exactly qualifies for preferential treatment.

That's out of line. Nobody has asked for preferential treatment. I came willing to spend my copious amounts of free time reading documentation, looking at logs, and trying to figure out solutions. Instead, I got a lecture on "respect".

There are some mighty thin skins around here....

Rob Church wrote:

...

... No doubt I left bits of the description out; feel free to amend it. :)

Thank you, Rob. It doesn't explain the 3rd hand "high reliability" conclusion, but helps keep previous comments in perspective.

Tim Starling

23 Apr 23 Apr

9:51 p.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

William Allen Simpson wrote:

...

Tim Starling wrote:

...
William Allen Simpson wrote:

...
Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.

Since I still don't have a clue what you're talking about, preventing it from happening in the future might be difficult. I'll ask again: what is "local time"? Which local time are you talking about?

Since the thread hasn't had the words "local time" for some time, it was hard to figure out your query. Going back to my first message, there is a "local time" in parentheses. In context, it is clear that "last night (local time)" relates to your data center. That is, local night.

Meet the sysadmin team (with corresponding timezones)

Brion: PDT -7 Kyle: EDT -4 Kate: BST +1 JeLuF: CEDT +2 Mark: CEDT +2 Domas: usually EET +2, currently travelling somewhere Me: AEST +10

The only person who lives near the servers is Kyle, and he was hired as a hardware tech for that exact reason. The servers all have their clocks set to UTC, and we use UTC in logs and communications.

...

Had you looked at the graphs (or other logs) at the time, or anytime in the following day (all I would hope), the data was obvious. Since you didn't, I've attached some of the dozen or so that I saved.

The time was 02:30+ UTC.

Ah, well in that case, I can tell you exactly what happened. The hero of the day was Zsinj, a canny newbie who had his eye on the relevant monitoring graphs, and alerted us to the problem immediately, using very specific terms, allowing us to track down and fix it rapidly.

Log extract from #wikimedia-tech follows, times are UTC+10.

[12:32] <Zsinj> Just had a SQL spike and squids almost flatlined. What happened? [12:34] <TimStarling> Zsinj: had? [12:34] <Interiot> network is slow maybe? [12:34] <TimStarling> oh, continuing [12:34] <Zsinj> SQL loads adb2, db4, and ariel all skyrocket load averages [12:35] <Zsinj> db2, db4, and ariel* [12:39] <Zsinj> hm, something caused sq1 to really cut its load, that's what increased the demand on the SQL servers. [12:39] <DragonFire2410> Wikinews not responding [12:39] * mboverload stabs himself because someone thought it was a good idea to put the datacenter in Flordia [12:39] <Zsinj> what's wrong with florida? [12:40] <Zsinj> TimStarling: Perhaps something with sq1?

At about this time I identified the problem as being a flood of Special:Listusers type queries. Fearing a deliberate DoS attack, I invited Zsinj into the private channel, so that he could watch while we fixed the problem that he described so well and so promptly. My immediate response was to disable the special page and kill the queries. While I was doing that, I had time to type a few lines into IRC. Kate (consanguinity) was also there.

[12:44] * Zsinj (n=chatzill@node230-67.unnamed.db.erau.edu) has joined #secretchannel [12:44] <TimStarling> invited Zsinj [12:44] <Zsinj> Hello everyone. [12:45] <TimStarling> I'm a bit busy at the moment [12:46] <TimStarling> basically we had a flood of requests for Special:Listusers with a limit of 500 [12:46] <Zsinj> Interesting. [12:46] <TimStarling> it could have been a deliberate DoS attack [12:46] <Zsinj> it certainly has the symptoms of one. [12:46] <consanguinity> was it the same page? more likely someone trying to list all the users, if not [12:47] <Zsinj> or a rogue bot request? [12:47] <consanguinity> assume stupidity before malice ;-) [12:47] <TimStarling> well yeah, these things have almost always turned out to be accidents [12:47] <TimStarling> but I thought I'd better switch channel just in case [12:48] <Zsinj> Good precaution. :) [12:48] <TimStarling> I disabled that special page altogether [12:48] <TimStarling> then I killed the queries, still in progress [12:48] <Zsinj> squids and apaches are on the upswing and SQL load is decreasing. [12:49] <consanguinity> is listusers still using limit,offset or was it slow for another reason?

Some pasting of queries and thinking out loud followed. I worked out the problem and applied a temporary fix at 13:03 UTC+10. I summarised it in a later conversation with Domas on #wikimedia-tech, who among other things is our local DB expert:

[00:04] <Wegge> So it's around 5 or 6 AM for you? [00:04] <dammit> 7AM [00:11] * dammit is now known as domas [00:17] <TimStarling> hi domas [00:17] <domas> hey Tim [00:17] <TimStarling> we had some stuff going on earlier today with GROUP BY that you might be interested in [00:17] <TimStarling> well, not today in your timezone obviously [00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace, user_name AS title, user_name as value, user_id, COUNT(ug_group) as numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY user_name, user_id ORDER BY value LIMIT 50,50; [00:18] <TimStarling> 50 rows in set (32.94 sec) [00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace, user_name AS title, user_name as value, user_id, COUNT(ug_group) as numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY user_name ORDER BY value LIMIT 50,50; [00:18] <TimStarling> 50 rows in set (0.01 sec) [00:18] <TimStarling> the first one groups by two columns [00:19] <TimStarling> it was done in the name of PostgreSQL compatibility: http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-April/014586.html [00:19] <domas> riiiiight [00:19] <domas> probably it's better to use min/max for constants [00:19] <domas> rather than putting them to GROUP BY [00:19] <domas> this was mistake I was doing few years ago too [00:22] <TimStarling> well, I reverted the extra fields in GROUP BY, there were a few of them in that patch [00:25] <TimStarling> http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-April/014764.html

-- Tim Starling

Tim Starling

10:40 p.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

...

[00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace, user_name AS title, user_name as value, user_id, COUNT(ug_group) as numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY user_name, user_id ORDER BY value LIMIT 50,50; [00:18] <TimStarling> 50 rows in set (32.94 sec) [00:18] <TimStarling> mysql> SELECT 'Listusers' as type, 2 AS namespace, user_name AS title, user_name as value, user_id, COUNT(ug_group) as numgroups FROM `user` LEFT JOIN `user_groups` ON user_id=ug_user GROUP BY user_name ORDER BY value LIMIT 50,50; [00:18] <TimStarling> 50 rows in set (0.01 sec)

Some general thoughts about this while it's on my mind: the key here to minimising the impact of this kind of problem is isolation, rather than distribution. We already have good isolation for search, and improving isolation for images -- if one of those two services goes offline then the rest should stay up, unaffected. Maybe it's time we introduced a "basic" query group, containing those queries required for pages views. Then we could send all "basic" queries to a dedicated cluster, and all other queries to a second isolated cluster. Then as long as we can keep the apache thread count low enough, any problem with those diverse special page queries would not affect page view performance.

We could go even further and split the apache cluster into an "ordinary page view" cluster and an "everything else" cluster. This would mitigate DoS attacks on apache resources.

Any comments?

-- Tim Starling

Brion Vibber

24 Apr 24 Apr

1:24 a.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

Tim Starling wrote:

...

Some general thoughts about this while it's on my mind: the key here to minimising the impact of this kind of problem is isolation, rather than distribution. We already have good isolation for search, and improving isolation for images -- if one of those two services goes offline then the rest should stay up, unaffected.

As background for those not familiar, here's the situation with search:

The actual search work is performed by a daemon using the Lucene search library. This is running on three servers, which our main PHP application servers can contact over HTTP internally. If the HTTP request is rejected or times out (and the timeout is obscenely short), the PHP side tries a couple more servers, until it either finds one that works or runs out and gives up.

So if the search servers are all overloaded or down, you just get a nice little error message and are offered the chance to use an external search (google/yahoo/etc). No immediate gratification, but the site stays up.

When we first tried this system, the timeout and failover wasn't yet used -- if the daemon encountered certain exceptions or got overloaded it would leave connections hanging for a long time., All the available threads would fill up on the apaches; a hundred php processes just waiting on their search results... *kaboom*

The image fileserver is currently a potential problem, as the application servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and if the server goes down it tends to hang for long periods of time, with similar problem results.

Improvements to this could include minimizing our contact with the file server (avoid unnecessary reads and checks for file existence; we've got a damn database) and potentially using some more explicit file upload protocol which can fail gracefully.

...

Maybe it's time we introduced a "basic" query group, containing those queries required for pages views. Then we could send all "basic" queries to a dedicated cluster, and all other queries to a second isolated cluster. Then as long as we can keep the apache thread count low enough, any problem with those diverse special page queries would not affect page view performance.

Probably wise.

...

We could go even further and split the apache cluster into an "ordinary page view" cluster and an "everything else" cluster. This would mitigate DoS attacks on apache resources.

Slightly less trivial, but probably doable.

-- brion vibber (brion @ pobox.com)

William Allen Simpson

1:50 a.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

Brion Vibber wrote:

...

The image fileserver is currently a potential problem, as the application servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and if the server goes down it tends to hang for long periods of time, with similar problem results.

Well, I'll stop by and ask the fellow in charge of this for the Linux kernel, possibly tomorrow, or maybe 10am donuts on Weds.

...

Improvements to this could include minimizing our contact with the file server (avoid unnecessary reads and checks for file existence; we've got a damn database) and potentially using some more explicit file upload protocol which can fail gracefully.

Both sound like good ideas.

Jay R. Ashworth

8:23 a.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

On Sun, Apr 23, 2006 at 03:50:30PM -0400, William Allen Simpson wrote:

...

Brion Vibber wrote:

...
The image fileserver is currently a potential problem, as the application servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and if the server goes down it tends to hang for long periods of time, with similar problem results.

...

Well, I'll stop by and ask the fellow in charge of this for the Linux kernel, possibly tomorrow, or maybe 10am donuts on Weds.

Showing off is such bad form, William. :-)

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?

Tim Starling

10:44 a.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

Brion Vibber wrote:

...

The image fileserver is currently a potential problem, as the application servers use NFS to manipulate files on it. NFS is notoriusly tempermental, and if the server goes down it tends to hang for long periods of time, with similar problem results.

Improvements to this could include minimizing our contact with the file server (avoid unnecessary reads and checks for file existence; we've got a damn database) and potentially using some more explicit file upload protocol which can fail gracefully.

One fairly simple thing to do would be to reduce the NFS timeout substantially. Currently we use a timeout of 1.4 seconds, then it backs off exponentially for a total of 1.4 + 2.8 + 5.6 = 9.8 seconds. If I'm reading this right, that's in addition to the RPC timeout, whatever that is. I don't know what amane's typical response time is at peak load, but I suspect it's orders of magnitude less than that.

The structural problem with NFS is that a timeout is required on every request. There's no global state, so every attempted read incurs the same timeout penalty. I believe this is not a problem with AFS:

http://www.openafs.org/pages/doc/UserGuide/auusg004.htm#HDRWQ17

We do have the same problem with MediaWiki's MySQL, memcached and search access, but at least we have straightforward application-level control over timeouts and retries.

-- Tim Starling

Jay R. Ashworth

25 Apr 25 Apr

7:58 p.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

On Mon, Apr 24, 2006 at 02:44:54PM +1000, Tim Starling wrote:

...

The structural problem with NFS is that a timeout is required on every request. There's no global state, so every attempted read incurs the same timeout penalty. I believe this is not a problem with AFS:

http://www.openafs.org/pages/doc/UserGuide/auusg004.htm#HDRWQ17

Has anyone inspected what the potential gains and losses might be to moving to something like AFS, GFS, Coda, etc?

Cheers, -- jra

William Allen Simpson

24 Apr 24 Apr

1:44 a.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

Tim Starling wrote:

...

The only person who lives near the servers is Kyle, and he was hired as a hardware tech for that exact reason. The servers all have their clocks set to UTC, and we use UTC in logs and communications.

That would be the standard practice, but....

As noted in the message itself, I used the time listed in the log on the external monitoring site, http://www.thewritingpot.com/wikistatus/ garnered from the recent message of garion1000@gmail.com, and named the subject line accordingly. Later, I also included the words:

# There's no corresponding admin log for 2:30 UTC....

Sorry that you found the subject line misleading, I'll try to do better in the future....

...

William Allen Simpson wrote:

...
The time was 02:30+ UTC.

Ah, well in that case, I can tell you exactly what happened. The hero of the day was Zsinj, a canny newbie who had his eye on the relevant monitoring graphs, and alerted us to the problem immediately, using very specific terms, allowing us to track down and fix it rapidly.

Log extract from #wikimedia-tech follows, times are UTC+10.

Ah, another place that doesn't use UTC logs....

Anyway, thank you for following up. I still don't understand how the SQL spike affected network performance, and have not yet found the switch and router graph with IP subnet assignments.

And thanks to my questions, more of us know where to find the graphs!

Brion Vibber

2:07 a.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

William Allen Simpson wrote:

...

Anyway, thank you for following up. I still don't understand how the SQL spike affected network performance, and have not yet found the switch and router graph with IP subnet assignments.

Congested database -> slower responses to web requests -> less traffic served.

-- brion vibber (brion @ pobox.com)

Phil Boswell

5:35 p.m.

New subject: 2006-04-21T02:30 drastic slowdown postmortem

William Allen Simpson-2 wrote:

...

Tim Starling wrote:

...
Log extract from #wikimedia-tech follows, times are UTC+10.

Ah, another place that doesn't use UTC logs....

That's an IRC channel, so the log would be client-side, so the TZ would perforce be client-based.

HTH HAND

-- Phil -- View this message in context: http://www.nabble.com/2006-Apr-20-drastic-slowdown-postmortem-t1486956.html#... Sent from the Wikipedia Developers forum at Nabble.com.

Mark Bergsma

23 Apr 23 Apr

11:25 p.m.

William Allen Simpson wrote:

...

That 84.40.24.22 inverse is only at 2 DNServers both located on the same subnet (very bad practice):

[snip]

I wouldn't worry so much about reverse DNS, since we are only using it for cosmetic purposes...

...

Also, I'm seeing incorrect DNS configuration (CNAME to CNAME): ;; ANSWER SECTION: en.wikipedia.org. 92 IN CNAME rr.wikimedia.org. rr.wikimedia.org. 600 IN CNAME rr.pmtpa.wikimedia.org.

CNAME chaining is not incorrect, just "discouraged". And yes, we are fully aware of the consequences of this setup.

-- Mark mark@nedworks.org

6792

Age (days ago)

6796

Last active (days ago)

wikitech-l@lists.wikimedia.org

21 comments

10 participants

tags (0)

participants (10)

Anders Wegge Jakobsen
Brion Vibber
Christopher Beland
Jay R. Ashworth
Mark Bergsma
Phil Boswell
Platonides
Rob Church
Tim Starling
William Allen Simpson