[Labs-l] NFS server outage today

Wed Aug 14 17:07:02 UTC 2013

Hello all,

In order to work at diagnosing the underlying issue causing the NFS
issues we've been having, I will be copying the shared storage to a
non-thin-provisioned filesystem and rolling back the NFS server kernel
to a version known to be properly compatible with the controller
hardware without issue (i.e.: the same used in production with identical
hardware).

What this means in practice is that there will be a short outage to NFS
service (~30 minutes) during the switch, after which the filesystem will
return without the timetravel snapshot features (which is the reason why
we were using the newer kernel).

Annoyingly, due to some technical constraints with NFS, this probably
means that instances having mounted NFS filesystems will have to be
rebooted after the switch (as the FSID will change).  If your instance
gives you errors stating that you have "stale NFS handle"s after the
switch, this is what happened and will be fixed with a reboot.

If the problem persists with the older kernel and driver, then we have
actual hardware issues and will switch hardware around to solve it
(which will require another outage in the following days).  If the
switch to the older kernel /does/ fix the issue, then we will continue
using that configuration (no snapshots) until the driver regression has
been solved upstream or with the vendor.

I am planning the outage for 20:00 UTC; provided the copy takes roughly
the estimated amount of time.  In case the actual stalls slow things
down and I need to push it back, I'll send another update to the mailing
list.

-- Marc