On Thu, Mar 31, 2016 at 12:39 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:
I think it's stretching the metaphor to call ops a
"tight ship". We
could switch off spare servers in codfw for a substantial power
saving, in exchange for a ~10 minute penalty in failover time. But it
would probably cost a week or two of engineer time to set up suitable
automation for failover and periodic updates.
Just a small clarification: I don't think turning off and on
periodically servers would be a feasible option because servers (and
computers in general) tend to have a pretty high failure rate when
being powered off and on regularly. We see this with some server
failing every time we do a mass reboot due to some security issue. On
the other hand, we could surely do better in terms of idle-server
power consumption. In terms of costs and time spent (and probably also
natural resources consumption, but I did no calculation whatsoever) it
would probably be not sustainable.
Or we could have avoided a hot spare colo altogether,
with smarter
disaster recovery plans, as I argued at the time.
Another small clarification: our codfw datacenter is _not_ just a hot
spare for disaster recovery and a lot of work has been done to make
the two facilities mostly active-active (and a lot more will be done
in the coming year).
Cheers,
Giuseppe
P.S. The server energy footprint of the WMF is negligible if compared
to the big internet players, but even a small-medium size local ISP
has probably a larger footprint than us. This doesn't mean we should
not try to get better, but we should always put things in prespective.
--
Giuseppe Lavagetto
Senior Technical Operations Engineer, Wikimedia Foundation