-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi,
At about 22:30 UTC last night (Tuesday) one of our power circuits went down for about 15 minutes. This affected one node of the HA cluster which was hosting the following services:
Sun Grid Engine master server tsbot IRC bot DNS recursor MySQL server for sql-toolserver MySQL replication support infrastructure LDAP server
All services failed over to the other node and were online again within 22 seconds. However, MySQL did not respond well to losing its replication connection and had to be restarted manually, causing about 30 minutes replication lag.
- river.
River Tarnell schrieb:
All services failed over to the other node and were online again within 22 seconds. However, MySQL did not respond well to losing its replication connection and had to be restarted manually, causing about 30 minutes replication lag.
This is so much better than what would have happened last year, it's actually a reason to celebrate. Thanks for the good work, river!
-- daniel
Raises an interesting question, what made the server go down in the first place? Surely the power to the server would be 1+1? (IE: Dual Powers supplies attached to separate power circuits, powered by separate UPS and Generator grids respectively)
This kind of redundancy is expected in data centers now days and I assume that all the TS servers are in a data center. Just a curious question as to why this obviously isn't the case.
-Brett
-----Original Message----- From: toolserver-l-bounces@lists.wikimedia.org [mailto:toolserver-l-bounces@lists.wikimedia.org] On Behalf Of River Tarnell Sent: Wednesday, 30 June 2010 9:24 AM To: Wikimedia Toolserver Announcements Cc: Wikimedia Toolserver Discussion Subject: [Toolserver-l] Power outage
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi,
At about 22:30 UTC last night (Tuesday) one of our power circuits went down for about 15 minutes. This affected one node of the HA cluster which was hosting the following services:
Sun Grid Engine master server tsbot IRC bot DNS recursor MySQL server for sql-toolserver MySQL replication support infrastructure LDAP server
All services failed over to the other node and were online again within 22 seconds. However, MySQL did not respond well to losing its replication connection and had to be restarted manually, causing about 30 minutes replication lag.
- river.
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brett Hillebrand:
Raises an interesting question, what made the server go down in the first place? Surely the power to the server would be 1+1?
The two servers which form the HA cluster do not have redundant power, because power (at least in Amsterdam) is very expensive, and we did not consider it worth the cost to avoid 22 seconds of downtime in the fairly rare event of a power failure.
All other servers in the affected rack, which are not in an HA configuration, have redundant PSUs and were not affected by the outage.
- river.
toolserver-l@lists.wikimedia.org