[QA] Fwd: [Labs-l] Filesystem downtime to schedule

Wed Dec 31 18:05:14 UTC 2014

On Wed, Dec 31, 2014 at 10:48 AM, Bryan Davis <bd808 at wikimedia.org> wrote:

> I think this will mess some things up in beta (file uploads, debug
> logging). If having a partial outage of beta from just after a group0
> deploy to just before a new branch release sounds bad to any of you
> you may want to respond to Coren's thread on labs-l and suggest a
> better time (Thurs-Fri?).
>

I think Thu/Fri would definitely be preferable. People do a ton of
pre-deploy checking on beta labs early in the week. VE for sure and
probably MobileFrontend would be particularly affected.

>
> Bryan
>
>
> ---------- Forwarded message ----------
> From: Marc A. Pelletier <marc at uberbox.org>
> Date: Wed, Dec 31, 2014 at 10:11 AM
> Subject: [Labs-l] Filesystem downtime to schedule
> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>
>
> Hello Labs,
>
> Many of you may recall that until some point late 2013, one of the
> features of the labs file server was that it provided time travel
> snapshots (you could see a consistent view of the filesystem as it
> existed 1h, 2h, 3h, 1d, 2d, 3d and 1 week ago).
>
> This was disabled at that time - despite being generally considered
> valuable - because it was suspected to be (part of) the stability
> problems the NFS server suffered at the time.  This turns out to not
> have been the case, and we could turn it back on now.
>
> Indeed, doing so is a prerequisite to the planned replication of the
> filesystem in the new datacenter where a redundant Labs installation
> is slated to be deployed[1].
>
> The issue is that turning that feature back on requires changing the
> way the disk space is currently allocated at a low level[2] and
> necessitates a fairly long period of partial downtime during which
> data is being copied from one part of the disk subsystem to the other.
> In practice, this would require the primary partitions (/home and
> /data/project) to be set readonly for a period on the order of a day
> (24-30 hours).
>
> That downtime is pretty much unavoidable eventually as it is a
> requirement of expanding labs and improving data resillience and
> reliability, but the /timing/ of that is flexible.  I wanted to "poll"
> labs users as to when the possibility of disruption is minimized, and
> give everyone plenty of time to make contingency planning and/or
> notify their endusers of the expected period of reduced availability.
>
> Provided there is a good consensus that the week is a better time than
> the weekend (I am guessing here that volunteer coders and users are
> more active during the weekend) then I would suggest starting the
> operation on Tuesday, January 13 at 18:00 UTC.  The period of downtime
> is expected to last until January 14, 18:00 UTC but may extend a few
> hours beyond that.
>
> The expected impacts are:
>
> * Starting at the beginning of the window, /home and /data/project
> will switch to readonly mode; any attempt to write to files to those
> trees will result in EROFS errors being thrown.  Reading from those
> filesystems will still work as expected, so would writing to other
> filesystems;
> * Read performance may degrade noticably as the disk subsystem will be
> loaded to capacity;
> * It will not be possible to manipulate the gridengine queue -
> specifically, starting or stopping jobs will not work; and
> * At the end of the window, when the operation is complete, the "old"
> file system will go away and be replaced by the new one - this will
> cause any access to files or directories that were previously opened
> (including working directories) on the affected filesystems to error
> out with ESTALE.  Reopening files by name will access the new copy
> identical to the one at the time the filesystems became readonly.
>
> In practice, that latter impact has the effect that most running
> programs will be unable to continue unless they have special handling
> for this situation, and most gridengine jobs will no longer be able to
> log output.  It may be a good idea to restart any continuous tool at
> that time.  All webservices that were running at the start of the
> maintenance window will be restarted at that time.
>
> If you have tools or other processes running that do not rely on being
> able to write to /data/project, they may be able to continue running
> during the downtime without interruption.  Jobs that only access the
> network (for instance, the Mediawiki API) or the databases will not
> likely be affected.  Because of this, no automatic or forcible restart
> of running (non-webservice) jobs will be made.
>
> In particular, if you have a tool whose continued operation is
> important, temporarily modifying it so that it works from
> /data/scratch may be a good workaround.
>
> Finally, in order to avoid risks of the filesystem move taking longer
> than expected and increasing downtime significantly, LOG FILES OVER 1G
> WILL BE NOT BE COPIED.  If you have critical files that are not simple
> log files but whose names end in .log, .err or .out then you MUST
> compress those files if you absolutely require them to survive the
> transition.  Alternately, truncating them to some size comfortably
> smaller than 1G will work if the file must remain uncompressed.
>
> The speed and reliability of the maintenance process depends on the
> total data to copy.  If you can clean up both your home and project
> directories of extraneous files, you'll help the process greatly.  :-)
>
> Thanks all,
>
> -- Marc
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> --
> Bryan Davis              Wikimedia Foundation    <bd808 at wikimedia.org>
> [[m:User:BDavis_(WMF)]]  Sr Software Engineer            Boise, ID USA
> irc: bd808                                        v:415.839.6885 x6855
>
> _______________________________________________
> QA mailing list
> QA at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/qa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/qa/attachments/20141231/fcf89cb5/attachment.html>