Tim Starling wrote:
What is "local time"? Please state your times in UTC. The page you link to doesn't go back as far as April 20, and it doesn't appear to have any archive links.
Sorry, had no idea that site didn't keep an archive.
In any case, there's not much point in complaining about slow response times a day after the fact. As I told you before, the best place to contribute to this sort of thing is on #wikimedia-tech.
http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034991.html
Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.
Besides, IRC isn't a very conducive for planning, email exchange is much preferable.
There are no other clusters which fill the same role as pmtpa. Go to this page:
For failover, every cluster needs its own copy of the SQL database (slaved), and its own apache servers, and its own squid.
After all, ISP customers aren't calling support because they cannot edit, it's because they aren't getting pages served.
and tell me how fast the site would be if every one of those Database::query or memcached::get calls required a couple of transatlantic RTTs. Using centralised caches improves the hit rate, and keeping them within a few kilometres of the apache servers makes the latency acceptable.
Strawman. Are the Tampa apache's using some sort of memcache shared between them? Then, how do the Seoul apache's share that?
Also, the DNS stopped serving inverse addresses. Compare:
[...]
That 84.40.24.22 inverse is only at 2 DNServers both located on the same subnet (very bad practice):
Maybe you should complain to whoever owns those servers.
Since they appear to be serving your net, apparently you either own them, or you are paying for them one way or another.
I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart, presumably your clusters. Good practice.
However, that loss of DNS responses from the same subnet leads to the conclusion the subnet might be under congestive collapse. That is, this lag might not be produced by wikimedia itself, but a problem with the link to or within the facility.
I very much doubt it. Did you try testing for packet loss by pinging a Wikimedia server?
Yes, of course, for most folks that's the first thing to do! (100% loss.) Then, traceroutes from various looking glasses to see whether the problem is path specific. (Showed a couple of those earlier.)
Again, something caused all the squid and apaches to stop getting bytes and packets in. I saved the ganglia .gifs, would you prefer I sent them as attachments?
Our MRTG stuff is still down following the loss of larousse, but you can still use these:
http://ganglia.wikimedia.org/ http://tools.wikimedia.de/~leon/stats/reqstats/ https://wikitech.leuksman.com/view/Server_admin_log
That ganglia is RRDTool, which isn't too bad. Would be nice to see the interface byte and packet counts for the switches and upstream routers. That would have told more about the bottleneck, assuming it was a link issue. Could have been something else, but hard to know without data.
In this case, the dip shows up on all clusters, even though it probably only affected Tampa. That's because all measurement is from one place.
Whenever I've setup a POP, I like to have an NTP chimer, MRTG, and a separate DNS instance all running (usually on the same box). That way, even when the main site is down, the others are still running and collecting data. I find that customers may not like the fact the mail servers are down, but as long as they can still fetch data from elsewhere, they're less likely to be completely unhappy.
You've got a bastion at several clusters, where would the documentation be for what you're running at each?
I've looked at https://wikitech.leuksman.com/view/All_servers, but its hopelessly sparse (and out of date).