[Labs-l] More glusterfs woes

Thu Feb 14 23:21:17 UTC 2013

On Thu, Feb 14, 2013 at 11:20 AM, Ryan Lane <rlane at wikimedia.org> wrote:

> The glusterd process went into a death spiral at around midnight UTC last
> night. The glusterfs/glusterfsd processes continued to work fine, which
> allowed the filesystem to continue to work properly, but all four servers
> were approaching swap death.
>
> This version of gluster also has issues with the upstart scripts. It won't
> properly start/stop the gluster services. I'm having to reboot the hosts.
> I'm going to track down this issue today in a labs instance. For the next
> few hours some projects will have issues accessing project and/or home
> directories.
>
> This will not affect services using instance storage (/mnt).
>
>
Volumes are being force restarted right now. login should work to all nodes
and project storage should work perfectly fine for most projects,
currently. All volumes should be completely up in about an hour.

The glusterfs folks can't reproduce the upstart issue we're seeing in our
cluster. As a workaround for now, I've replaced the upstarts with init
scripts, which behave exactly as expected. It should be possible to work
around the outage condition we had today in the future without a prolonged
volume force start state.

- Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20130214/57f1fc33/attachment.html>