Tim Starling wrote:
William Allen Simpson wrote:
Wouldn't want to bother anybody during an outage, as I'm sure that folks are busy. The point of a postmortem is to figure out how to prevent the same from happening in the future.
Since I still don't have a clue what you're talking about, preventing it from happening in the future might be difficult. I'll ask again: what is "local time"? Which local time are you talking about?
Since the thread hasn't had the words "local time" for some time, it was hard to figure out your query. Going back to my first message, there is a "local time" in parentheses. In context, it is clear that "last night (local time)" relates to your data center. That is, local night.
Had you looked at the graphs (or other logs) at the time, or anytime in the following day (all I would hope), the data was obvious. Since you didn't, I've attached some of the dozen or so that I saved.
The time was 02:30+ UTC.
The load, CPU, and network dropped off the Squids, and the Apaches (network incoming stayed the same, outgoing dropped), at the same time that SQL load and CPU leaped (network showed mild incoming decrease and outgoing increase, the inverse of the apaches). Told you exactly which servers.
No corresponding note in the admin log. Perhaps somebody remembers doing something unusual at the time?
Don't give me this "strawman" crap. You've been here 2 weeks and you think you know the site better than I do? Unless you're willing to treat the existing sysadmin team with respect it deserves, I'm not interested in dealing with you.
Never said that I did. That's why I've been asking questions. The level of site documentation is excrable.
However, it would be nicer for you to treat folks offering help with the respect *they* deserve. After all, I do happen to have 30+ years of experience in the field, organized the state government funding for NSFnet (the academic precursor to the Internet) 20 years ago, was an original member of the North American Network Operators Group (NANOG), have written a fair few Internet standards over the years, among other things.
http://www.google.com/search?q=%22William+Allen+Simpson%22
The yaseo apaches serve jawiki, mswiki, thwiki and kowiki. The memcached cluster for those 4 wikis is also located in yaseo. We discussed allowing remote apaches to serve read requests from a local slave database, proxying write requests back to the location of the master database. The problem is that cache writes and invalidations are required even on read requests.
Yes, this is obvious and well-known. They are just caches, improvements in local efficiency.
While distributed shared memory systems with cache coherency and asynchronous write operations have been implemented several times, especially in academic circles, I'm yet to find one which is suitable for production use in a web application such as MediaWiki. When you take into account that certain kinds of cache invalidation must be synchronised with database writes and squid cache purges, the problem of distribution, taken as a whole, would be a significant project.
Amazingly, I happen to be sitting just 1 1/2 blocks from one of those "academic circles", the Center for Information Techology Integration of the University of Michigan in Ann Arbor, Michigan.
Last year, we discussed the possibility of setting up a second datacentre within the US. But it was clear that centralisation, at least on a per-wiki level, gives the best performance for a given outlay, especially when development time and manageability are taken into account. Of course this performance comes at the expense of reliability. But Domas assured us that it is possible to obtain high availability with a single datacentre, as long as proper attention is paid to internal redundancy.
Yes, faster, cheaper, better; pick two (as the old saying goes).
Not knowing "Domas" (or whether that's a name or a company), I'm not sure of the basis for the assurance. Had you checked with other sites, I'm pretty sure you'd have heard that reliability from a single data center is extremely unlikely.
With the two recent power failures, it's clear that proper attention wasn't paid, but that's another story.
No, that's the same old story. It's practically guaranteed.
Automatic failover to a read-only mirror would be much easier than true distribution, but I don't think we have the hardware to support such a high request rate, outside of pmtpa.
Agreed. So, it's probably time to think about fixing that problem.
In the end it comes down to a trade-off between costs and availability. Given the non-critical nature of our service, and the nature of our funding, I think it's prudent to accept, say, a few hours of downtime once every few months, in exchange for much lower hardware, development and management costs. If PowerMedium can't provide this level of service despite being paid good money, I think we should find a facility that can.
Agreed.
I've noted that the ns0, ns1, and ns2 for wikimedia are located far apart, presumably your clusters. Good practice.
Don't be patronising.
So, when I'm asking critical questions, I'm not giving you the respect you deserve, but by giving you an "attaboy", I'm patronizing?
Sounds like somebody is lacking some social graces.
I'll just note in passing that the current documentation lists https://wikitech.leuksman.com/view/DNS * ns0.wikimedia.org - 207.142.131.207 (secondary IP on zwinger) * ns1.wikimedia.org - 207.142.131.208 (larousse) * ns2.wikimedia.org - 145.97.39.158 (secondary IP on pascal)
You know, that bad practice of having 2 on the same subnet, mentioned a couple of messages back.... So, the note was supposed to be encouragement, notwithstanding that the documentation is wrong. The only reason I know that it's been improved is by a bit of archeology with dig.
If the external network was down for 20 minutes then it's PowerMedium's problem. They probably lost a router or something. I have better things to worry about.
The external network losses correspond to huge peaks in the MySQL graphs. So, I doubt you have better things to worry about -- that appears to be congestive collapse caused by something happening within your servers.
Even the loss of a router or switch or link is of concern, especially coupled with other problems such as the loss of power. Not knowing your SLA, it may be a refund is due.
Anyway, I thought a postmortem was in order.... Professionals do that kind of thing.