Austin Hair wrote:
However we are reading a few bits off of zwinger's NFS (some block lists etc, some lock files) and sometimes writing (logs). Insofar as those are currently used they should be either migrated to a more survivable situation or should be able to fail gracefully. NFS should be set up if it's not in a way that will fail cleanly after a short timeout.
Linux mount option "soft" will cause an I/O error to be returned after a "major timeout," the definition of which varies. "intr" in combination with "hard" will allow the program to respond to signals, which is in most cases preferable to having an uninterruptable process sitting there until reboot.
We mount NFS with soft and timeo=14. I imagine retrans is at its default value of 3, so if I understand the manual correctly, that gives a major timeout of 9.8 seconds. That would be consistent with what we saw in the crash -- most apps don't seem to abort when they get one of these timeouts, they just treat it as an ordinary read error and continue their execution. It's not surprising that everything locked up, including root logins.
What about using a detachable filesystem like Coda, or a spare NFS server with automatic failover?
-- Tim Starling