Ray Saintonge wrote:
Tim Starling wrote:
We can put other wikis in other colos. The small
size of the subsidiary
colos will limit their usefulness as a redundant backup if the Florida
colo goes down, but that's OK. We have to consider our goals. I think in
the worst case scenario of an extended failure of the Florida colo, a
reasonable goal is to expect:
* Read-only service after a few hours, using the other colos
* Limited (i.e. slow) read/write service a few hours after that
* Permanent data loss on the order of a few transactions
* Full read/write service after repair or replacement
Currently the main vulnerable points appear to be network and power, and
we've experienced problems with both in the last 6 months. I'm not an
expert on either, but I'm assured securing these services to a
reasonable degree is possible.
I would suggest suggest that any new major backup colo in North America
should be in a different power grid area. This would protect from the
possible consequences of a major blackout such as occurred in 2003. As
I understand the situation North America has two major power grids, one
in the East and one in the West. Only Texas and Quebec have independent
grids.
I was thinking more along the lines of battery backup to tide us over
until the generator comes online, and a redundant network uplink to
guard against switch failure. As I tried to explain, a second US colo
would lead to reduced performance, unless it was within 100km of the
current colo.
Putting half your hardware on one grid and half on another means you
lose half your capacity if the power goes off. There is no need for
this. There is already a diesel generator on site, we just need to cover
various kinds of short-term failure. We've seen two short-term failures:
a main circuit breaker trip and a power strip circuit breaker trip. The
power strip failure could have been prevented by having a proper PDU
with independent breakers, I believe one is now on order. Various
threats to the main power, including the main circuit breaker, can be
dealt with by supplying the DB servers with a UPS, and negotiating with
the colo to ensure that their supply is fully redundant. The main power
failure apparently only lasted for a matter of seconds, if they don't
intend to guard against such short failures, then we need to make our
own arrangements.
-- Tim Starling