[QA] Fwd: [Labs-l] Filesystem downtime to schedule

Bryan Davis bd808 at wikimedia.org
Wed Dec 31 17:48:34 UTC 2014

I think this will mess some things up in beta (file uploads, debug
logging). If having a partial outage of beta from just after a group0
deploy to just before a new branch release sounds bad to any of you
you may want to respond to Coren's thread on labs-l and suggest a
better time (Thurs-Fri?).


---------- Forwarded message ----------
From: Marc A. Pelletier <marc at uberbox.org>
Date: Wed, Dec 31, 2014 at 10:11 AM
Subject: [Labs-l] Filesystem downtime to schedule
To: Wikimedia Labs <labs-l at lists.wikimedia.org>

Hello Labs,

Many of you may recall that until some point late 2013, one of the
features of the labs file server was that it provided time travel
snapshots (you could see a consistent view of the filesystem as it
existed 1h, 2h, 3h, 1d, 2d, 3d and 1 week ago).

This was disabled at that time - despite being generally considered
valuable - because it was suspected to be (part of) the stability
problems the NFS server suffered at the time.  This turns out to not
have been the case, and we could turn it back on now.

Indeed, doing so is a prerequisite to the planned replication of the
filesystem in the new datacenter where a redundant Labs installation
is slated to be deployed[1].

The issue is that turning that feature back on requires changing the
way the disk space is currently allocated at a low level[2] and
necessitates a fairly long period of partial downtime during which
data is being copied from one part of the disk subsystem to the other.
In practice, this would require the primary partitions (/home and
/data/project) to be set readonly for a period on the order of a day
(24-30 hours).

That downtime is pretty much unavoidable eventually as it is a
requirement of expanding labs and improving data resillience and
reliability, but the /timing/ of that is flexible.  I wanted to "poll"
labs users as to when the possibility of disruption is minimized, and
give everyone plenty of time to make contingency planning and/or
notify their endusers of the expected period of reduced availability.

Provided there is a good consensus that the week is a better time than
the weekend (I am guessing here that volunteer coders and users are
more active during the weekend) then I would suggest starting the
operation on Tuesday, January 13 at 18:00 UTC.  The period of downtime
is expected to last until January 14, 18:00 UTC but may extend a few
hours beyond that.

The expected impacts are:

* Starting at the beginning of the window, /home and /data/project
will switch to readonly mode; any attempt to write to files to those
trees will result in EROFS errors being thrown.  Reading from those
filesystems will still work as expected, so would writing to other
* Read performance may degrade noticably as the disk subsystem will be
loaded to capacity;
* It will not be possible to manipulate the gridengine queue -
specifically, starting or stopping jobs will not work; and
* At the end of the window, when the operation is complete, the "old"
file system will go away and be replaced by the new one - this will
cause any access to files or directories that were previously opened
(including working directories) on the affected filesystems to error
out with ESTALE.  Reopening files by name will access the new copy
identical to the one at the time the filesystems became readonly.

In practice, that latter impact has the effect that most running
programs will be unable to continue unless they have special handling
for this situation, and most gridengine jobs will no longer be able to
log output.  It may be a good idea to restart any continuous tool at
that time.  All webservices that were running at the start of the
maintenance window will be restarted at that time.

If you have tools or other processes running that do not rely on being
able to write to /data/project, they may be able to continue running
during the downtime without interruption.  Jobs that only access the
network (for instance, the Mediawiki API) or the databases will not
likely be affected.  Because of this, no automatic or forcible restart
of running (non-webservice) jobs will be made.

In particular, if you have a tool whose continued operation is
important, temporarily modifying it so that it works from
/data/scratch may be a good workaround.

Finally, in order to avoid risks of the filesystem move taking longer
than expected and increasing downtime significantly, LOG FILES OVER 1G
WILL BE NOT BE COPIED.  If you have critical files that are not simple
log files but whose names end in .log, .err or .out then you MUST
compress those files if you absolutely require them to survive the
transition.  Alternately, truncating them to some size comfortably
smaller than 1G will work if the file must remain uncompressed.

The speed and reliability of the maintenance process depends on the
total data to copy.  If you can clean up both your home and project
directories of extraneous files, you'll help the process greatly.  :-)

Thanks all,

-- Marc

Labs-l mailing list
Labs-l at lists.wikimedia.org

Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
[[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855

More information about the QA mailing list