[Labs-l] Postmortem on the NFS issues

Marc A. Pelletier marc at uberbox.org
Mon Jul 1 19:49:12 UTC 2013


Bleh.

So, it turns out that in all likelihood, last night reboot of the NFS
server has caused some corruption in the underlying filesystem that
quickly started to degrade after 12:00 UTC to the point of unusability.

Attempts to repair the filesystem did not succeed and, in order to get
usability back up, I've made a copy of its content from a slightly older
snapshot (~1.5h) to a new filesystem and substituted it for the previous
one.  That maneuver restored functionality but may require a restart of
the instances using NFS (the tools project has already been restarted
for that purpose).

We do not yet know exactly what caused the initial corruption, but the
broken filesystem and its snapshits have been kept so that I can
investigate it.  In the meantime, the NFS server has been switched to
have storage on ext4 so that if the issue is in interaction between XFS
(the previous filesystem) and block storage, the issue should not recur.

-- Marc



More information about the Labs-l mailing list