hi,
we had an outage of ~5 hours affecting all login servers from 04:15 UTC to around 09:30 UTC. the issue was fixed fairly quickly once noticed, but it took some time for an admin to be contacted.
the outage was caused by a misconfiguration on the cluster which prevented switching the NFS service from a failed node to a working one. the misconfiguration has been fixed, so failover should work correctly in the future. (this was our first unscheduled cluster failover, so it hadn't been tested before.)
a full description of the outage is available in MNT-56.
- river.
toolserver-l@lists.wikimedia.org