<div dir="ltr">I think this lessened considerably after <a href="https://phabricator.wikimedia.org/T161898#3197879">https://phabricator.wikimedia.org/T161898#3197879</a> but did happen a few times in the last 24 hours (then again we have had for sometime an underlying low level of puppet transient issues). I think we'll know in a few days if this was a light 24h by coincidence or if we are chipping away here.<div><br></div><div>I made some changes to get better insight into what exactly is happening when Puppet is vomiting and to get better consistency in <a href="https://gerrit.wikimedia.org/r/#/c/349433/">https://gerrit.wikimedia.org/r/#/c/349433/</a></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 20, 2017 at 8:53 AM, Chase Pettet <span dir="ltr"><<a href="mailto:cpettet@wikimedia.org" target="_blank">cpettet@wikimedia.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
<br>
So if I'm reading nfs-mount-manager correctly, it sounds like<br>
`/usr/bin/timeout -k 10s 20s ls $2` is failing in the<br>
`nfs-mount-manager check` for the Exec resource and then the Exec<br>
tries to mount the export again? Would that mount command ever<br>
actually work on an active host?<br><span class="m_-7866274986592557248gmail-m_-7072142663675906658HOEnZb"></span></blockquote><div><br></div><div><br></div></span><div>This is all a game of <a href="https://i.ytimg.com/vi/Pk1ue1tolFc/maxresdefault.jpg" target="_blank">mousetrap</a> where we hope that the end result is either a Puppet run failing in a way that doesn't screw things up further because NFS is unavailable (for a variety of reasons) or ends in successfully ensuring mounts are ready to go. Then we want that to end up trapping the mouse in the missing mount case(s), the unhealthy mount cases, and the absent mount cases.</div><div><br></div><div>So the working theory there is that in the case of a check failing for either condition of not mounted or it is mounted but it's unhealthy to where it appears unmounted we try to mount. The idea being it may not succeed in the unhealthy case but it will succeed in the new case, and fail as expected in the absent case. We could create two 'check' like case statements with one for health and one for mount status, but historically trying to handle them separately ended up with far more edge cases than considering them together. So I think the remount question is answered with: check has two conditions that can fail and the declarative Puppet idiom is to try to mount something if it comes back failing no matter what because mount itself is a safe operation. Safe in the sense that it will fail sanely if something happens to be mounted at the time it tries (and that's ok because to get there it already failed to show up as healthy prior and we are just carrying it forward). That is probably more opinion on how or if things should surface than anything else. This is all a huge mess and most of nfs_mount.pp should be written as a custom function I think for our own sanity and future debugging. Madhu and I talked about this previously but the varied conditions to fail for safely along with recovery for take ages to run through the playbook test conditions and there hasn't been time.</div><div><br></div><div>So things I thought of looking here:</div><div><br></div><div>* rewrite nfs-mount-manager in python</div><div>* break out nfs_mount into a custom Puppet function with more logging and debug trappings</div><div>* consider breaking up check for nfs-mount-manager into 'health' and 'status' (for mount)</div><div>* update the timeout settings in nfs-mount-manager to match <a href="https://gerrit.wikimedia.org/r/#/c/348788/" target="_blank">https://gerrit.<wbr>wikimedia.org/r/#/c/348788/</a></div><div><br></div><div>That all said this isn't the core problem persay as we have another very vanilla "Can NFS service me?" safety check before doing some grid things that is intermittently the failing component as well. </div><div><br></div><div>In modules/toollabs/manifests/<wbr>init.pp </div><div><br></div><div><div> exec {'ensure-grid-is-on-NFS':</div><div> command => '/bin/false',</div><div> unless => "/usr/bin/timeout -k 5s 30s /usr/bin/test -e ${project_path}/herald",</div><div> }</div></div><div><br></div><div>This is the failure about half the spot checks I've done, and nfs-mount-manager check is the other half. I put this second one in awhile ago because in some cases of NFS unavailability we were going on ahead with the whole insane resource collection madness on the then local disk and doing grid setup things that thought this was a node in a whacky state. But anyway, this failing is basically dead simple "I can't see a thing on NFS".</div><div><br></div><div>I caught the tool in this comment today and it /definitely/ was causing these failures, although that doesn't mean it's the only thing doing it.</div><div><br></div><div><a href="https://phabricator.wikimedia.org/T161898#3197517" target="_blank">https://phabricator.wikimedia.<wbr>org/T161898#3197517</a><br></div><div><br></div><div><br></div><div><br></div></div><br><br clear="all"><div><br></div>
</div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>Chase Pettet</div></div><div>chasemp on <a href="https://phabricator.wikimedia.org/p/chasemp/" target="_blank">phabricator</a> and IRC<br></div></div></div></div></div></div></div></div>
</div>