On Thu, Mar 31, 2016 at 12:39 AM, Tim Starling tstarling@wikimedia.org wrote:
I think it's stretching the metaphor to call ops a "tight ship". We could switch off spare servers in codfw for a substantial power saving, in exchange for a ~10 minute penalty in failover time. But it would probably cost a week or two of engineer time to set up suitable automation for failover and periodic updates.
Just a small clarification: I don't think turning off and on periodically servers would be a feasible option because servers (and computers in general) tend to have a pretty high failure rate when being powered off and on regularly. We see this with some server failing every time we do a mass reboot due to some security issue. On the other hand, we could surely do better in terms of idle-server power consumption. In terms of costs and time spent (and probably also natural resources consumption, but I did no calculation whatsoever) it would probably be not sustainable.
Or we could have avoided a hot spare colo altogether, with smarter disaster recovery plans, as I argued at the time.
Another small clarification: our codfw datacenter is _not_ just a hot spare for disaster recovery and a lot of work has been done to make the two facilities mostly active-active (and a lot more will be done in the coming year).
Cheers,
Giuseppe P.S. The server energy footprint of the WMF is negligible if compared to the big internet players, but even a small-medium size local ISP has probably a larger footprint than us. This doesn't mean we should not try to get better, but we should always put things in prespective.