William Allen Simpson wrote:
Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.
Since I still don't have a clue what you're talking about, preventing it from happening in the future might be difficult. I'll ask again: what is "local time"? Which local time are you talking about?
Besides, IRC isn't a very conducive for planning, email exchange is much preferable.
There are no other clusters which fill the same role as pmtpa. Go to this page:
For failover, every cluster needs its own copy of the SQL database (slaved), and its own apache servers, and its own squid.
After all, ISP customers aren't calling support because they cannot edit, it's because they aren't getting pages served.
and tell me how fast the site would be if every one of those Database::query or memcached::get calls required a couple of transatlantic RTTs. Using centralised caches improves the hit rate, and keeping them within a few kilometres of the apache servers makes the latency acceptable.
Strawman. Are the Tampa apache's using some sort of memcache shared between them? Then, how do the Seoul apache's share that?
Don't give me this "strawman" crap. You've been here 2 weeks and you think you know the site better than I do? Unless you're willing to treat the existing sysadmin team with respect it deserves, I'm not interested in dealing with you.
The yaseo apaches serve jawiki, mswiki, thwiki and kowiki. The memcached cluster for those 4 wikis is also located in yaseo. We discussed allowing remote apaches to serve read requests from a local slave database, proxying write requests back to the location of the master database. The problem is that cache writes and invalidations are required even on read requests.
While distributed shared memory systems with cache coherency and asynchronous write operations have been implemented several times, especially in academic circles, I'm yet to find one which is suitable for production use in a web application such as MediaWiki. When you take into account that certain kinds of cache invalidation must be synchronised with database writes and squid cache purges, the problem of distribution, taken as a whole, would be a significant project.
Last year, we discussed the possibility of setting up a second datacentre within the US. But it was clear that centralisation, at least on a per-wiki level, gives the best performance for a given outlay, especially when development time and manageability are taken into account. Of course this performance comes at the expense of reliability. But Domas assured us that it is possible to obtain high availability with a single datacentre, as long as proper attention is paid to internal redundancy.
With the two recent power failures, it's clear that proper attention wasn't paid, but that's another story.
Automatic failover to a read-only mirror would be much easier than true distribution, but I don't think we have the hardware to support such a high request rate, outside of pmtpa.
In the end it comes down to a trade-off between costs and availability. Given the non-critical nature of our service, and the nature of our funding, I think it's prudent to accept, say, a few hours of downtime once every few months, in exchange for much lower hardware, development and management costs. If PowerMedium can't provide this level of service despite being paid good money, I think we should find a facility that can.
I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart, presumably your clusters. Good practice.
Don't be patronising.
However, that loss of DNS responses from the same subnet leads to the conclusion the subnet might be under congestive collapse. That is, this lag might not be produced by wikimedia itself, but a problem with the link to or within the facility.
I very much doubt it. Did you try testing for packet loss by pinging a Wikimedia server?
Yes, of course, for most folks that's the first thing to do! (100% loss.) Then, traceroutes from various looking glasses to see whether the problem is path specific. (Showed a couple of those earlier.)
Again, something caused all the squid and apaches to stop getting bytes and packets in. I saved the ganglia .gifs, would you prefer I sent them as attachments?
If the external network was down for 20 minutes then it's PowerMedium's problem. They probably lost a router or something. I have better things to worry about.
You've got a bastion at several clusters, where would the documentation be for what you're running at each?
I've looked at https://wikitech.leuksman.com/view/All_servers, but its hopelessly sparse (and out of date).
If it's not there then it probably doesn't exist.
-- Tim Starling