Domas has said that in his opinion, we can meet our availability goals
even if we expand the Florida colo, rather than putting servers
elsewhere in the US. If this is indeed the case (and Domas has more
experience than any of us with this sort of thing) then I'd be in favour
of this course of action. It's not clear that we can acheive good
performance without a central database/apache cluster serving the
English Wikipedia.
We can put other wikis in other colos. The small size of the subsidiary
colos will limit their usefulness as a redundant backup if the Florida
colo goes down, but that's OK. We have to consider our goals. I think in
the worst case scenario of an extended failure of the Florida colo, a
reasonable goal is to expect:
* Read-only service after a few hours, using the other colos
* Limited (i.e. slow) read/write service a few hours after that
* Permanent data loss on the order of a few transactions
* Full read/write service after repair or replacement
I don't think it's worth doubling our hardware or accepting a permanent
performance reduction in order to reduce downtime in this unlikely case.
However it is certainly sensible to analyse possible single points of
failure in the main colo, and to take steps to eliminate them.
Currently the main vulnerable points appear to be network and power, and
we've experienced problems with both in the last 6 months. I'm not an
expert on either, but I'm assured securing these services to a
reasonable degree is possible.
Note that eliminating SPOFs is not sufficient to guarantee service. If
we have 5 DB servers running at 95% capacity, we can hardly expect to
lose one of them without performance problems. There are similar
problems in the upstream network -- two weeks ago we temporarily had
extremely high packet loss rates, probably due to a link or router
failure in the backbone network. The solution to this is obvious but
expensive -- buy spares or add capacity.
-- Tim Starling