Hello all,
during the maintenance window yesterday evening the hole cluster was down for ~30min starting ~21:20 UTC. The problem was independent of the maintenance working, but caused the window to extend. The problem was an out-of-memory on one of our HA-nodes. Unfortunately the box did not restart itself and its ha-buddy did not detect the problem too, so the services of the out-of-memory-box were not switched to the other box. This caused the hole cluster to stand until I manually rebooted the host. I will look if I can find some kind of sensor for that; in worst case I will enable our old "reboot if low on memory"-script again.
Sincerely, DaB.
toolserver-announce@lists.wikimedia.org