[Labs-l] Custom nagios checks

Tue Feb 4 08:49:04 UTC 2014

Well, so imagine there is a configuration file that is managed by
puppet, you figure out there is something wrong which must be changed
otherwise it imposes a security risk. You can't change it because
puppet keeps reverting it back and there is no one awake in ops team
to merge the patch.

What other option do you have other than turning puppet off on that instance?

I am talking about some service or software that YOU might wrote or
set up and nobody else understand, so there is no point in having any
review from 3rd party. (for example clue-bot)

Even if this might be technically possible, to work like this, I doubt
that anyone would prefer this way of working on labs, unless they had
to. That is why I think that if puppet was the only option to maintain
custom nagios checks, most of people would just not use nagios to
check if their own services are working...

On Tue, Feb 4, 2014 at 9:30 AM, Ryan Lane <rlane32 at gmail.com> wrote:
> On Tue, Feb 4, 2014 at 12:14 AM, Petr Bena <benapetr at gmail.com> wrote:
>>
>> I think that Ryan said something like he would most happily get rid of
>> puppet or replace it with a better solution :P but if you really want
>> to keep stuff managed by puppet, I still see an issue with other
>> projects which aren't using puppet, or which do use different
>> puppetmaster.
>>
>
> I didn't say that. I said if you're starting from scratch you should
> consider something other than puppet. That wasn't about Labs or Wikimedia at
> all.
>
>>
>> To be honest, from my point of view, puppet as it is now on labs is
>> almost unusable for non-ops users. Getting any simple change merged
>> unless it's top priority thing requires someone from ops, and usually
>> take at least few hours if not days. I can't imagine any sysadmin who
>> can work like this, some changes need to be applied immediately, you
>> can't wait for them to happen for days, so I expect that waste
>> majority of projects that exist now will not use puppet anyway (you
>> just can't force people to use it under these circumstances), so they
>> wouldn't benefit from this.
>>
>
> You shouldn't be making changes to systems without code review. Wikimedia
> Ops generally has a bad practice in this regard (self-merging). It's mostly
> historical. Other places I've worked at or consulted with *require* code
> review to merge.
>
> So you know, I work like this (and I'm pretty reasonably productive, from
> most people's perspective).
>
>>
>> That is why I think that even if we are to use this puppet nrpe
>> management there still should be a way for manual adjustments and not
>> just because of these projects, but also to fix other icinga issues.
>> For example right now it receive some nonsense (broken) data from ldap
>> about instances that don't even exist anymore. If there wasn't that
>> nasty workaround consisting of instance ignore list, that prevents
>> these hosts from being monitored, icinga would be full of hosts that
>> are down. How would you apply i_dont_exist puppet class to nonexisting
>> node? :P
>>
>
> Did you put a bug in about the broken data?
>
>>
>> I have nothing against "labs cloning production" beside that IMHO it
>> should be the other way (production should actually clone labs, which
>> is the testing env where changes should happen first before they get
>> deployed on production), but still labs != production so I think we
>> could have some extra thing here that would make it easier to manage
>> icinga for regular, non-ops people which would exist on labs only and
>> not on production.
>>
>
> The biggest reason we can't do the same thing in labs and production for
> nagios is that in production nagios is generated via exported resources,
> which are disabled in labs.
>
> As far as I know that and ssh host keys are the only things in Wikimedia's
> puppet that requires exported resources.
>
> - Ryan
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>