On 2/26/18 10:00 AM, Chase Pettet wrote:
A lot of things are in the works for which I'll either add an agenda
item to the weekly or will have a followup meeting but it has reached
the point of a preface email to any discussion being more efficient.
/Please/ 'ack' this with a response because there are things in here
that affect everyone on the team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
Yep! We've discussed this and I think
it's the right approach. I'll be
working on the Mitaka move during/right after my labweb work. I'm
certainly in favor of staying on the .deb train as long as possible.
== On monitoring and alerting ==
I have made a change for myself that has the following effect:
regular critical alerts are on a standard 'awake' schedule and
wmcs-team alerts are still 24/7.
If this means I can stop getting paged for
db-server outages in the
night, I want it!
Chico has expressed a desire to contribute while IRC
is dormant and we
have begun a series of 1:1 conversations about our environment. He
has been working on logic to alert on a portion of puppet failures
<https://gerrit.wikimedia.org/r/c/411315>rather than than every puppet
failure. This, to my mind, does not mean we have solved the puppet
flapping issue but it's also not doing us any good to be fatigued by
an issue we do not have time to investigate that has been seemingly
benign for a year. I am considering whether we should move this to
tools.checker, increase retries on our single puppet alerting logic,
and add alerting to the main icinga for it. Hopefully, we can talk
abou this in our meeting.
This sounds good, although I don't want to train new
people that those
puppet alerts are unsolvable and forever with us. I continue to hope
that there's an actual fixable problem under there somewhere.
== Naming (the worst of all things) ==
==== cloud ====
'cloudvirtXXXX' sounds fine to me. Shall we start back at
001 for hosts
in the new naming scheme, or continue to count up from the lab* numbering?
==== labtest ====
Any of the proposed options for this are fine with me; agreed
that
getting 'test' out of there should reduce confusion.
==== deployments and regions (oh my) =====
I, absurdly, have more to write but this is enough for a single email.
Implications for Neutron actually happening, Debian, next wave of
reboots, team practices, and more will be separate. Please ack this
and provide feedback or I'm a runaway train.
I read all this and agree that deployment/region names will be a
terrible mouthful, but don't have better ideas. I'll leave this for you
and Arturo to hash out :)
Thanks for thinking all this through!
-A