[Labs-admin] PROBLEM alert - tools-exec-1433/Puppet run is CRITICAL

Thu Apr 20 13:53:47 UTC 2017

>
>
>
> So if I'm reading nfs-mount-manager correctly, it sounds like
> `/usr/bin/timeout -k 10s 20s ls $2` is failing in the
> `nfs-mount-manager check` for the Exec resource and then the Exec
> tries to mount the export again? Would that mount command ever
> actually work on an active host?
>

This is all a game of mousetrap
<https://i.ytimg.com/vi/Pk1ue1tolFc/maxresdefault.jpg>  where we hope that
the end result is either a Puppet run failing in a way that doesn't screw
things up further because NFS is unavailable (for a variety of reasons) or
ends in successfully ensuring mounts are ready to go.  Then we want that to
end up trapping the mouse in the missing mount case(s), the unhealthy mount
cases, and the absent mount cases.

So the working theory there is that in the case of a check failing for
either condition of not mounted or it is mounted but it's unhealthy to
where it appears unmounted we try to mount.  The idea being it may not
succeed in the unhealthy case but it will succeed in the new case, and fail
as expected in the absent case.  We could create two 'check' like case
statements with one for health and one for mount status, but historically
trying to handle them separately ended up with far more edge cases than
considering them together.  So I think the remount question is answered
with: check has two conditions that can fail and the declarative Puppet
idiom is to try to mount something if it comes back failing no matter what
because mount itself is a safe operation.  Safe in the sense that it will
fail sanely if something happens to be mounted at the time it tries (and
that's ok because to get there it already failed to show up as healthy
prior and we are just carrying it forward).  That is probably more opinion
on how or if things should surface than anything else.  This is all a huge
mess and most of nfs_mount.pp should be written as a custom function I
think for our own sanity and future debugging.  Madhu and I talked about
this previously but the varied conditions to fail for safely along with
recovery for take ages to run through the playbook test conditions and
there hasn't been time.

So things I thought of looking here:

* rewrite nfs-mount-manager in python
* break out nfs_mount into a custom Puppet function with more logging and
debug trappings
* consider breaking up check for nfs-mount-manager into 'health' and
'status' (for mount)
* update the timeout settings in nfs-mount-manager to match
https://gerrit.wikimedia.org/r/#/c/348788/

That all said this isn't the core problem persay as we have another very
vanilla "Can NFS service me?" safety check before doing some grid things
that is intermittently the failing component as well.

In modules/toollabs/manifests/init.pp

    exec {'ensure-grid-is-on-NFS':
        command => '/bin/false',
        unless  => "/usr/bin/timeout -k 5s 30s /usr/bin/test -e
${project_path}/herald",
    }

This is the failure about half the spot checks I've done, and
nfs-mount-manager check is the other half.  I put this second one in awhile
ago because in some cases of NFS unavailability we were going on ahead with
the whole insane resource collection madness on the then local disk and
doing grid setup things that thought this was a node in a whacky state.
But anyway, this failing is basically dead simple "I can't see a thing on
NFS".

I caught the tool in this comment today and it /definitely/ was causing
these failures, although that doesn't mean it's the only thing doing it.

https://phabricator.wikimedia.org/T161898#3197517
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-admin/attachments/20170420/b1132f4a/attachment.html>

[Labs-admin] ** PROBLEM alert - tools-exec-1433/Puppet run is CRITICAL **

[Labs-admin] PROBLEM alert - tools-exec-1433/Puppet run is CRITICAL