[Labs-l] Filesystem downtime to schedule

Wed Dec 31 17:11:38 UTC 2014

Hello Labs,

Many of you may recall that until some point late 2013, one of the 
features of the labs file server was that it provided time travel 
snapshots (you could see a consistent view of the filesystem as it 
existed 1h, 2h, 3h, 1d, 2d, 3d and 1 week ago).

This was disabled at that time - despite being generally considered 
valuable - because it was suspected to be (part of) the stability 
problems the NFS server suffered at the time.  This turns out to not 
have been the case, and we could turn it back on now.

Indeed, doing so is a prerequisite to the planned replication of the 
filesystem in the new datacenter where a redundant Labs installation is 
slated to be deployed[1].

The issue is that turning that feature back on requires changing the way 
the disk space is currently allocated at a low level[2] and necessitates 
a fairly long period of partial downtime during which data is being 
copied from one part of the disk subsystem to the other.  In practice, 
this would require the primary partitions (/home and /data/project) to 
be set readonly for a period on the order of a day (24-30 hours).

That downtime is pretty much unavoidable eventually as it is a 
requirement of expanding labs and improving data resillience and 
reliability, but the /timing/ of that is flexible.  I wanted to "poll" 
labs users as to when the possibility of disruption is minimized, and 
give everyone plenty of time to make contingency planning and/or notify 
their endusers of the expected period of reduced availability.

Provided there is a good consensus that the week is a better time than 
the weekend (I am guessing here that volunteer coders and users are more 
active during the weekend) then I would suggest starting the operation 
on Tuesday, January 13 at 18:00 UTC.  The period of downtime is expected 
to last until January 14, 18:00 UTC but may extend a few hours beyond that.

The expected impacts are:

* Starting at the beginning of the window, /home and /data/project will 
switch to readonly mode; any attempt to write to files to those trees 
will result in EROFS errors being thrown.  Reading from those 
filesystems will still work as expected, so would writing to other 
filesystems;
* Read performance may degrade noticably as the disk subsystem will be 
loaded to capacity;
* It will not be possible to manipulate the gridengine queue - 
specifically, starting or stopping jobs will not work; and
* At the end of the window, when the operation is complete, the "old" 
file system will go away and be replaced by the new one - this will 
cause any access to files or directories that were previously opened 
(including working directories) on the affected filesystems to error out 
with ESTALE.  Reopening files by name will access the new copy identical 
to the one at the time the filesystems became readonly.

In practice, that latter impact has the effect that most running 
programs will be unable to continue unless they have special handling 
for this situation, and most gridengine jobs will no longer be able to 
log output.  It may be a good idea to restart any continuous tool at 
that time.  All webservices that were running at the start of the 
maintenance window will be restarted at that time.

If you have tools or other processes running that do not rely on being 
able to write to /data/project, they may be able to continue running 
during the downtime without interruption.  Jobs that only access the 
network (for instance, the Mediawiki API) or the databases will not 
likely be affected.  Because of this, no automatic or forcible restart 
of running (non-webservice) jobs will be made.

In particular, if you have a tool whose continued operation is 
important, temporarily modifying it so that it works from /data/scratch 
may be a good workaround.

Finally, in order to avoid risks of the filesystem move taking longer 
than expected and increasing downtime significantly, LOG FILES OVER 1G 
WILL BE NOT BE COPIED.  If you have critical files that are not simple 
log files but whose names end in .log, .err or .out then you MUST 
compress those files if you absolutely require them to survive the 
transition.  Alternately, truncating them to some size comfortably 
smaller than 1G will work if the file must remain uncompressed.

The speed and reliability of the maintenance process depends on the 
total data to copy.  If you can clean up both your home and project 
directories of extraneous files, you'll help the process greatly.  :-)

Thanks all,

-- Marc