On Jun 7, 2004, at 9:08 AM, Ulrich Fuchs wrote:
Furthermore our problem shouldn't be the hardware any more. When we are not able to get a simple second database server to work since a half year now, it's not the hardware. The server wasn't brought down by too much traffic. Either we by the wrong hardware, or it's badly configured, or we have the wrong software because it doesn't scale. Then we should think about software, not about hardware. The bottlenecks is not the hardware. Sorry to say this, but there was enough money to get this thing running. Probably we spent it the wrong way.
Maybe the money was spent in the right way, but contingency planning failed us. Unlike most of my work in private sector doing failure recovery planning, I haven't seen a clear set of plans, goals, triggers, and timelines for bringing up cold spares, for deciding how much data loss is acceptable, for mirroring data for fastest recovery, for adding new hardware, etc.
If suda (literally) caught on fire, was there an understood, written plan for recovery? What was considered the acceptable downtime? Was the scenario tested? What software/hardware systems exist to handle rack fires? How about explosions at data centers? 300% surges in traffic over 24 hours? What is the planned, formal, command chain for decision making during crisis? Have all the decisions been made already, so the command chain is not a problem? Has a much more expensive 15 minutes of total downtime per catastrophic event been budgeted for, rather than a much cheaper 16 hours per event?
Maybe I'm wrong, and this multi-hour outage is exactly the planned for, and expected, result from recent events, and once we started having problems, the policy and procedures kicked in. I kind of doubt it, though.
-Bop