On Mon, Feb 26, 2018 at 5:00 PM, Chase Pettet <cpettet(a)wikimedia.org> wrote:
A lot of things are in the works for which I'll either add an agenda item to
the weekly or will have a followup meeting but it has reached the point of a
preface email to any discussion being more efficient. /Please/ 'ack' this
with a response because there are things in here that affect everyone on the
team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
In Austin we had said that the long tailed, delayed, and (some would say)
tortuous march of Neutron should mean we stick on Liberty and Trusty for the
time being to avoid the historic moving target problem. In making the
annual plan and lining up the many changes that have to occur in the next 15
months it became clear that if we do all of this in series, instead of in
parallel, we will never make it. We have to shift more sand under our feet
than feels entirely comfortable. That means moving to Mitaka before/as-we
target Neutron in order to mix in Jessie with backports (which also has
Mitaka). The update to Mitaka has a few challenges -- primarily that the
designate project made significant changes. I think I would like to standup
new hypervisors ASAP once the main deployment is running Mitaka so we can
have customer workloads testing for as long as possible. This in theory sets
us up for an N+1 upgrade path on Debian through Stretch and Pike.
ACK.
== On monitoring and alerting ==
[...]
ACK.
== Naming (the worst of all things) ==
==== cloud ====
[...]
ACK. 'cloud' prefix.
==== labtest ====
Lab[test]* needs to be changed as well. The 'test' designation here has
been confusing for everyone who is not Andrew and myself numerous times over
the last year(s). For clarity, the lab[test] environment is a long lived
staging and PoC grounds for openstack provider testing where we need actual
integration into hardware, or where functionality cannot be tested in an
openstack-on-openstack way. Testing VXLAN overlay for instance is in this
category. Migration strategy for upgrade paths of Openstack itself,
especially where significant networking changes are made, would be in this
category. Hypervisor integration where kernel versions need to be vetted,
and package updates need to be canaried are in this category. Lab[test]
will never have tenants or projects other than ourselves. This has not been
obvious and, as an environment, it has been thought to be transient,
temporarily, and/or customer facing at various points.
My first instinct was to fold the [test] naming into whatever next phase
normal prepend we settle on (i.e. cloud). Bryan pointed out that making it
more difficult to discern between customer facing equipment and internal
equipment is a net-negative even if it did away with the confusion we are
living with now. I propose we add a indicator of [i] to all "cloud"
equipment and nothing with this indicator will ever be customer facing. The
current indicator of [test] is used both for hiera targeting via regex.yaml
and as a human indicator.
lab => cloud
cloudvirt1001
cloudcontrol1001
cloudservices1001
cloudnodepool1001
labtest => cloudi
cloudicontrol2003
cloudivirt2001
cloudivirt2002
Or open to suggestion, but we need to settle on something this week.
Let's be even more clear:
cloudvirt1001-dev
cloudcontrol1001-dev
cloudservices1001-dev
cloudnodepool1001-dev
or
cloudvirt1001-devel
cloudcontrol1001-devel
cloudservices1001-devel
cloudnodepool1001-devel
or
cloudvirt1001-test
cloudcontrol1001-test
cloudservices1001-test
cloudnodepool1001-test
This means, using a word suffix which is clear and meaningful to the eye.
If you don't like dashes '-', then without it.
cloudvirt1001devel
cloudcontrol1001devel
cloudservices1001devel
cloudnodepool1001devel
We could use the 'devel' keyword for new servers which are being
developed, before they get intro production.
And then, we could use the 'test' keyword for staging environments.
Of course we can use just one, I don't mind, the main point of my
proposal is the visual word prefix.
==== deployments and regions (oh my) =====
I have struggled with this damn naming thing for so long I am numb to it :)
I have the following theory: there is no defensible naming strategy only
ones that do not make you vomit.
===== Current situation =====
We have been working with the following assumptions: a "deployment" is a
superset of an openstack setup (keystone, nova, glance, etc) where each
"deployment" is a functional analog. i.e. even though striker is not an
openstack component it is a part of our openstack ...stack and as such is
assignable to a particular deployment. deployment => region =>
component(s)[availablility-zones]. Where we currently have 2 full and 1
burgeoning deployment: main (customer facing in eqiad), labtest (internal
use cases in codfw), and labtestn (internal PoC neutron migration
environment). FYI in purely OpenStack ecosystem terms, the shareable
portions between regions are keystone and horizon.
role::wmcs::openstack::main::control
deployment
-> region
--> availability zone
main
-> eqiad
--> nova
So far this has been fine and was a needed classification system to make our
code mulit-tenant at all. We are working with several drawbacks at the
moment: labtest is a terrible name (as described above), labtestn is
difficult to understand, if we pursue the labtest and labtestn strategy we
end up with mainn, regions and availability zones are not coupled to
deployment naming, these names while distinct do not lend themselves to
cohesive expansion. On and on, and nothing will be perfect but we can do a
lot better. I have had a lot of issues in finding a naming scheme that we
can live with here, such as:
* 'db' in the name issue
* 1001 looking like a host issue
* labtest is a prepend (labtestn is not)
* unclarity on internal/staging/PoC usage and customer facing
* schemes that provide hugely long and impractical names
===== proposed situation =====
I do not feel that enamored with any naming solution other than all the ones
I've tried end up with oddities and particular ugliness.
[site][numeric](deployment)
-> [site][numeric][r postfix for region] (region)
--> [site][numeric][region][letter postfix for row] (availability zone --
indicator for us that will last a long time I expect)
# eqiad0 is now 'main' and will be retired with neutron. It also will not
match the consistent naming for region, etc.
# legacy to be removed
# role::wmcs::openstack::eqiad0::control
eqiad0
-> eqiad
--> nova
# Once the current nova-network setup is retired we end up at deployment 1
in eqiad
eqiad1
-> eqiad1r
--> eqiad1rb
--> eqiad1rc
# role::wmcs::openstack::codfwi1::control
codfwi1
-> codfwi1r
--> codfwi1rb
codfwi2
-> codfwi2r
--> codfwi2rb
[...]
Likewise:
codfw2-test
- codfw2r-test
-- codfw2rb-test
or
codfw2devel
- codfw2rdevel
-- codfw2rbdevel
(same pattern of adding a meaningful suffix)