Hello all,
the hole toolserver-cluster was away between (circa) 19:12 and 20:30 UTC. The main-problem was that nfs stoped working and so no login, no web-apps and no command-line-command could have run. It looks that nfs stoped working because DNS-resolving stoped working (and AFAIS DNS stoped working because of low memory or a network-problem). The roots will investigate where the problem exactly was.
Just as information.
Sincerly, DaB.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
DaB.:
the hole toolserver-cluster was away between (circa) 19:12 and 20:30 UTC. The main-problem was that nfs stoped working and so no login, no web-apps and no command-line-command could have run. It looks that nfs stoped working because DNS-resolving stoped working (and AFAIS DNS stoped working because of low memory or a network-problem). The roots will investigate where the problem exactly was.
Hi,
This seems to have been some kind of kernel memory leak on turnera (one of the cluster nodes), although it's not clear what caused it yet. Since the userland was only very slow (rather than completely down) the cluster didn't notice the problem and fail over until someone rebooted the host manually.
I will be investigating exactly which service caused the problem.
- river.
toolserver-announce@lists.wikimedia.org