On 2/26/18 10:00 AM, Chase Pettet wrote:
A lot of things are in the works for which I'll either add an agenda item to the weekly or will have a followup meeting but it has reached the point of a preface email to any discussion being more efficient. /Please/ 'ack' this with a response because there are things in here that affect everyone on the team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
Yep! We've discussed this and I think it's the right approach. I'll be working on the Mitaka move during/right after my labweb work. I'm certainly in favor of staying on the .deb train as long as possible.
== On monitoring and alerting ==
I have made a change for myself that has the following effect: regular critical alerts are on a standard 'awake' schedule and wmcs-team alerts are still 24/7.
If this means I can stop getting paged for db-server outages in the night, I want it!
Chico has expressed a desire to contribute while IRC is dormant and we have begun a series of 1:1 conversations about our environment. He has been working on logic to alert on a portion of puppet failures https://gerrit.wikimedia.org/r/c/411315rather than than every puppet failure. This, to my mind, does not mean we have solved the puppet flapping issue but it's also not doing us any good to be fatigued by an issue we do not have time to investigate that has been seemingly benign for a year. I am considering whether we should move this to tools.checker, increase retries on our single puppet alerting logic, and add alerting to the main icinga for it. Hopefully, we can talk abou this in our meeting.
This sounds good, although I don't want to train new people that those puppet alerts are unsolvable and forever with us. I continue to hope that there's an actual fixable problem under there somewhere.
== Naming (the worst of all things) ==
==== cloud ====
'cloudvirtXXXX' sounds fine to me. Shall we start back at 001 for hosts in the new naming scheme, or continue to count up from the lab* numbering?
==== labtest ====
Any of the proposed options for this are fine with me; agreed that getting 'test' out of there should reduce confusion.
==== deployments and regions (oh my) =====
I, absurdly, have more to write but this is enough for a single email. Implications for Neutron actually happening, Debian, next wave of reboots, team practices, and more will be separate. Please ack this and provide feedback or I'm a runaway train.
I read all this and agree that deployment/region names will be a terrible mouthful, but don't have better ideas. I'll leave this for you and Arturo to hash out :)
Thanks for thinking all this through!
-A