<div dir="ltr">On Wed, Mar 6, 2013 at 8:54 AM, Petr Bena <span dir="ltr"><<a href="mailto:benapetr@gmail.com" target="_blank">benapetr@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">okay this is third time when we have same outage... bastion2 and 3<br>
were accessible for short time after bastion1's gluster died, then<br>
they died as well. public keys weren't accessible on any of them so<br>
basically labs were inaccessible for anyone.<br>
<br></blockquote><div><br></div><div>Ok. I tracked this down some. glusterd became unstable on all of the labstore nodes. It was crashing and restarting pretty often. The glusterfs service (which runs NFS) crashes with glusterd. The glusterfsd processes (which run gluster filesystems) are decoupled from the glusterd process, so they continue running without issue.<br>
<br></div><div>I just restarted all of the glusterd processes. That caused an nfs outage, which could only be fixed by killing all of the glusterfs processes, and restarting the glusterd processes again. This triggered the issue we're seeing with bastion1. It looks like long NFS timeouts in lucid make SSH inaccessible forever. The precise instances recover from this properly.<br>
<br></div><div>I'm going to rebuild bastion1 as precise (saving the SSH keys, of course) to workaround this issue.<br><br>- Ryan<br></div></div></div></div>