A lot of things are in the works for which I'll either add an agenda item to the weekly or will have a followup meeting but it has reached the point of a preface email to any discussion being more efficient. /Please/ 'ack' this with a response because there are things in here that affect everyone on the team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
In Austin we had said that the long tailed, delayed, and (some would say) tortuous march of Neutron should mean we stick on Liberty and Trusty for the time being to avoid the historic moving target problem. In making the annual plan and lining up the many changes that have to occur in the next 15 months it became clear that if we do all of this in series, instead of in parallel, we will never make it. We have to shift more sand under our feet than feels entirely comfortable. That means moving to Mitaka https://www.openstack.org/software/mitaka/ before/as-we target Neutron in order to mix in Jessie with backports (which also has Mitaka). The update to Mitaka has a few challenges -- primarily that the designate project made significant changes https://docs.openstack.org/designate/pike/admin/upgrades/mitaka.html. I think I would like to standup new hypervisors ASAP once the main deployment is running Mitaka so we can have customer workloads testing for as long as possible. This in theory sets us up for an N+1 upgrade path on Debian through Stretch and Pike. https://phabricator.wikimedia.org/T169099#3959060
== On monitoring and alerting ==
Last Oct I made a task https://phabricator.wikimedia.org/T178405 to update some of our alerting logic, and in Austin we talked about how to improve our coverage and move towards a rotation based workflow. The move to having a 'normal' on-call rotation, and especially one where we take better advantage of our time-zone spread, is going to require some more sophisticated management than we have now, primarily: escalations and complicated alerting and acknowledgement logic.
This came to the forefront again with the recent loss of labvirt1008. AFAICT the hypervisor rebooted in <=4m https://phabricator.wikimedia.org/T187292#3971877 and so did not alert. There is also the problem of it coming back up and not alerting on the "bad" state that has client instances shutdown. We reviewed that behavior and are in agreement that instances starting by default on hypervisor startup has more downsides than up, but it should still be an alert-able errant state. I created a wmcs-team https://gerrit.wikimedia.org/r/c/410525/ and added a check https://gerrit.wikimedia.org/r/c/413452/ that changes our new-normal to be some instance running on every active hypervisor. Then I proceeded adding a bunch of checks https://gerrit.wikimedia.org/r/q/topic:%2522openstack%2522+(status:open%20OR%20status:merged), adjusting existing checks to alert wmcs-team, and changing some checks to be 'critical' that were not.
The icinga setup in some ways makes single-tenant assumptions that we'll have to work through, such as: 'critical' alerts all opsen, and is the only way to override the configuration to never re-alert. At the moment none of the checks that alert purely wmcs-team, and not all of opsen, will re-alert. Some checks may double-alert where WMCS roots are in both groups. There is also the coverage issue where there are checks that may make sense for those us in this group to receive alerts 24/7, or at lower thresholds for warning, but it would cause fatigue to alert all of opsen. I have made a change for myself that has the following effect: regular critical alerts are on a standard 'awake' schedule and wmcs-team alerts are still 24/7. Andrew, Madhu, and myself have been on a 24/7 alerting schedule for a long time now, and I think shifting to 24/7 for wmcs-team things is an interim step for all of us. This has the side effect of requiring that all things we want to get alerted to 24/7 are alerting the wmcs-team contact group.
I am going to schedule a meeting to review what is currently alerting wmcs-team. This is both so that we can talk as a group about what should alert, and so that we can talk as a group about what does currently. I want everyone to walk away knowing what pages could be sent out and the basics of what they mean. I want everyone in the group to walk away feeling comfortable with our transitional strategy, and acknowledging as a group what things we need to know about 24/7. We can talk about how to take advantage of our time-zone spread in this arrangement, and briefly talk about what it would mean to move to something based on pagerduty/victorops.
The introduction of wmcs-team should allow us to also have our own IRC alerting in combination with #wikimedia-operations to #wikimedia-cloud-feed (or wherever). One of the complaints it seems we have all had is that while treating IRC as persistent for alerting is problematic, it is even more problematic in a channel as noisy as #wikimedia-operations.
Chico has expressed a desire to contribute while IRC is dormant and we have begun a series of 1:1 conversations about our environment. He has been working on logic to alert on a portion of puppet failures https://gerrit.wikimedia.org/r/c/411315rather than than every puppet failure. This, to my mind, does not mean we have solved the puppet flapping issue but it's also not doing us any good to be fatigued by an issue we do not have time to investigate that has been seemingly benign for a year. I am considering whether we should move this to tools.checker, increase retries on our single puppet alerting logic, and add alerting to the main icinga for it. Hopefully, we can talk abou this in our meeting.
== Naming (the worst of all things) ==
==== cloud ====
We have continued to phase out the word 'Lab', and even some networking equipment https://phabricator.wikimedia.org/T187933 has made the change. As part of the Debian and Neutron migrations we need to replace or re-image many of our servers, and it seems like the ideal time to acknowledge a 'cloud' variant naming replacement. In our weekly meeting I proposed 'cld' as a replacement to 'lab' outright. In discussions on ops-l it seems 'lab'=>'cloud' is most desired for simplicity and readability. 'cloud' as a prepend seems fine to me, and I don't anticipate objections within the team so I'm considering it decided (most of us are on ops-l).
==== labtest ====
Lab[test]* needs to be changed as well. The 'test' designation here has been confusing for everyone who is not Andrew and myself numerous times over the last year(s). For clarity, the lab[test] environment is a long lived staging and PoC grounds for openstack provider testing where we need actual integration into hardware, or where functionality cannot be tested in an openstack-on-openstack way. Testing VXLAN overlay for instance is in this category. Migration strategy for upgrade paths of Openstack itself, especially where significant networking changes are made, would be in this category. Hypervisor integration where kernel versions need to be vetted, and package updates need to be canaried are in this category. Lab[test] will never have tenants or projects other than ourselves. This has not been obvious and, as an environment, it has been thought to be transient, temporarily, and/or customer facing at various points.
My first instinct was to fold the [test] naming into whatever next phase normal prepend we settle on (i.e. cloud). Bryan pointed out that making it more difficult to discern between customer facing equipment and internal equipment is a net-negative even if it did away with the confusion we are living with now. I propose we add a indicator of [i] to all "cloud" equipment and *nothing with this indicator will ever be customer facing*. The current indicator of [test] is used both for hiera targeting via regex.yaml and as a human indicator.
lab => cloud
cloudvirt1001 cloudcontrol1001 cloudservices1001 cloudnodepool1001
labtest => cloudi
cloudicontrol2003 cloudivirt2001 cloudivirt2002
Or open to suggestion, but we need to settle on something this week.
==== deployments and regions (oh my) =====
I have struggled with this damn naming thing for so long I am numb to it :) I have the following theory: there is no defensible naming strategy only ones that do not make you vomit.
===== Current situation =====
We have been working with the following assumptions: a "deployment" is a superset of an openstack setup (keystone, nova, glance, etc) where each "deployment" is a functional analog. i.e. even though striker is not an openstack component it is a part of our openstack ...stack and as such is assignable to a particular deployment. deployment => region => component(s)[availablility-zones]. Where we currently have 2 full and 1 burgeoning deployment: main (customer facing in eqiad), labtest (internal use cases in codfw), and labtestn (internal PoC neutron migration environment). FYI in purely OpenStack ecosystem terms, the shareable portions between regions are keystone and horizon.
role::wmcs::openstack::main::control
deployment -> region --> availability zone
main -> eqiad --> nova
So far this has been fine and was a needed classification system to make our code mulit-tenant at all. We are working with several drawbacks at the moment: labtest is a terrible name (as described above), labtestn is difficult to understand, if we pursue the labtest and labtestn strategy we end up with mainn, regions and availability zones are not coupled to deployment naming, these names while distinct do not lend themselves to cohesive expansion. On and on, and nothing will be perfect but we can do a lot better. I have had a lot of issues in finding a naming scheme that we can live with here, such as:
* 'db' in the name issue * 1001 looking like a host issue * labtest is a prepend (labtestn is not) * unclarity on internal/staging/PoC usage and customer facing * schemes that provide hugely long and impractical names
===== proposed situation =====
I do not feel that enamored with any naming solution other than all the ones I've tried end up with oddities and particular ugliness.
[site][numeric](deployment) -> [site][numeric][r postfix for region] (region) --> [site][numeric][region][letter postfix for row] (availability zone -- indicator for us that will last a long time I expect)
# eqiad0 is now 'main' and will be retired with neutron. It also will not match the consistent naming for region, etc. # legacy to be removed # role::wmcs::openstack::eqiad0::control eqiad0 -> eqiad --> nova
# Once the current nova-network setup is retired we end up at deployment 1 in eqiad eqiad1 -> eqiad1r --> eqiad1rb --> eqiad1rc
# role::wmcs::openstack::codfwi1::control codfwi1 -> codfwi1r --> codfwi1rb
codfwi2 -> codfwi2r --> codfwi2rb
This takes our normal datacenter naming ([dc provider][airport]) and includes an 'i' if internal use cases along with a numeric postfix for deployment per site, and postfixes for sub-names such as "region" or "availability-zone". It's not phonetic but it could work. I am going to drop a few links I've walked through in the bottom section (#naming). My only ask is if you have a concern, please suggest an alternative that is thought out to at least 3 deployments per site and differentiates "internal" and "external" use cases. I can change our existing deployments without too much fanfare. These are basically key namespaces in hiera, and class namespaces in Puppet at the moment. I won't bother updating the regions or availability zones in place that exist now -- until redeployment. It becomes decidedly more fixed as we move into more eqiad deployments (as I have no plans to change the existing eqiad deployment in place). This is influenced by my experience in naming things in the networking world where there are multiple objects tied together to achieve a desired end, such as: foo-in-rule-set, foo-interface, foo-out-rule-set, foo-provider-1, etc.
I, absurdly, have more to write but this is enough for a single email. Implications for Neutron actually happening, Debian, next wave of reboots, team practices, and more will be separate. Please ack this and provide feedback or I'm a runaway train.
Best,
On Mon, Feb 26, 2018 at 5:00 PM, Chase Pettet cpettet@wikimedia.org wrote:
A lot of things are in the works for which I'll either add an agenda item to the weekly or will have a followup meeting but it has reached the point of a preface email to any discussion being more efficient. /Please/ 'ack' this with a response because there are things in here that affect everyone on the team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
In Austin we had said that the long tailed, delayed, and (some would say) tortuous march of Neutron should mean we stick on Liberty and Trusty for the time being to avoid the historic moving target problem. In making the annual plan and lining up the many changes that have to occur in the next 15 months it became clear that if we do all of this in series, instead of in parallel, we will never make it. We have to shift more sand under our feet than feels entirely comfortable. That means moving to Mitaka before/as-we target Neutron in order to mix in Jessie with backports (which also has Mitaka). The update to Mitaka has a few challenges -- primarily that the designate project made significant changes. I think I would like to standup new hypervisors ASAP once the main deployment is running Mitaka so we can have customer workloads testing for as long as possible. This in theory sets us up for an N+1 upgrade path on Debian through Stretch and Pike.
ACK.
== On monitoring and alerting ==
[...]
ACK.
== Naming (the worst of all things) ==
==== cloud ====
[...]
ACK. 'cloud' prefix.
==== labtest ====
Lab[test]* needs to be changed as well. The 'test' designation here has been confusing for everyone who is not Andrew and myself numerous times over the last year(s). For clarity, the lab[test] environment is a long lived staging and PoC grounds for openstack provider testing where we need actual integration into hardware, or where functionality cannot be tested in an openstack-on-openstack way. Testing VXLAN overlay for instance is in this category. Migration strategy for upgrade paths of Openstack itself, especially where significant networking changes are made, would be in this category. Hypervisor integration where kernel versions need to be vetted, and package updates need to be canaried are in this category. Lab[test] will never have tenants or projects other than ourselves. This has not been obvious and, as an environment, it has been thought to be transient, temporarily, and/or customer facing at various points.
My first instinct was to fold the [test] naming into whatever next phase normal prepend we settle on (i.e. cloud). Bryan pointed out that making it more difficult to discern between customer facing equipment and internal equipment is a net-negative even if it did away with the confusion we are living with now. I propose we add a indicator of [i] to all "cloud" equipment and nothing with this indicator will ever be customer facing. The current indicator of [test] is used both for hiera targeting via regex.yaml and as a human indicator.
lab => cloud
cloudvirt1001 cloudcontrol1001 cloudservices1001 cloudnodepool1001
labtest => cloudi
cloudicontrol2003 cloudivirt2001 cloudivirt2002
Or open to suggestion, but we need to settle on something this week.
Let's be even more clear:
cloudvirt1001-dev cloudcontrol1001-dev cloudservices1001-dev cloudnodepool1001-dev
or
cloudvirt1001-devel cloudcontrol1001-devel cloudservices1001-devel cloudnodepool1001-devel
or
cloudvirt1001-test cloudcontrol1001-test cloudservices1001-test cloudnodepool1001-test
This means, using a word suffix which is clear and meaningful to the eye. If you don't like dashes '-', then without it.
cloudvirt1001devel cloudcontrol1001devel cloudservices1001devel cloudnodepool1001devel
We could use the 'devel' keyword for new servers which are being developed, before they get intro production. And then, we could use the 'test' keyword for staging environments. Of course we can use just one, I don't mind, the main point of my proposal is the visual word prefix.
==== deployments and regions (oh my) =====
I have struggled with this damn naming thing for so long I am numb to it :) I have the following theory: there is no defensible naming strategy only ones that do not make you vomit.
===== Current situation =====
We have been working with the following assumptions: a "deployment" is a superset of an openstack setup (keystone, nova, glance, etc) where each "deployment" is a functional analog. i.e. even though striker is not an openstack component it is a part of our openstack ...stack and as such is assignable to a particular deployment. deployment => region => component(s)[availablility-zones]. Where we currently have 2 full and 1 burgeoning deployment: main (customer facing in eqiad), labtest (internal use cases in codfw), and labtestn (internal PoC neutron migration environment). FYI in purely OpenStack ecosystem terms, the shareable portions between regions are keystone and horizon.
role::wmcs::openstack::main::control
deployment -> region --> availability zone
main -> eqiad --> nova
So far this has been fine and was a needed classification system to make our code mulit-tenant at all. We are working with several drawbacks at the moment: labtest is a terrible name (as described above), labtestn is difficult to understand, if we pursue the labtest and labtestn strategy we end up with mainn, regions and availability zones are not coupled to deployment naming, these names while distinct do not lend themselves to cohesive expansion. On and on, and nothing will be perfect but we can do a lot better. I have had a lot of issues in finding a naming scheme that we can live with here, such as:
- 'db' in the name issue
- 1001 looking like a host issue
- labtest is a prepend (labtestn is not)
- unclarity on internal/staging/PoC usage and customer facing
- schemes that provide hugely long and impractical names
===== proposed situation =====
I do not feel that enamored with any naming solution other than all the ones I've tried end up with oddities and particular ugliness.
[site][numeric](deployment) -> [site][numeric][r postfix for region] (region) --> [site][numeric][region][letter postfix for row] (availability zone -- indicator for us that will last a long time I expect)
# eqiad0 is now 'main' and will be retired with neutron. It also will not match the consistent naming for region, etc. # legacy to be removed # role::wmcs::openstack::eqiad0::control eqiad0 -> eqiad --> nova
# Once the current nova-network setup is retired we end up at deployment 1 in eqiad eqiad1 -> eqiad1r --> eqiad1rb --> eqiad1rc
# role::wmcs::openstack::codfwi1::control codfwi1 -> codfwi1r --> codfwi1rb
codfwi2 -> codfwi2r --> codfwi2rb [...]
Likewise:
codfw2-test - codfw2r-test -- codfw2rb-test
or
codfw2devel - codfw2rdevel -- codfw2rbdevel
(same pattern of adding a meaningful suffix)
lab => cloud
cloudvirt1001 cloudcontrol1001 cloudservices1001 cloudnodepool1001
labtest => cloudi
cloudicontrol2003 cloudivirt2001 cloudivirt2002
Or open to suggestion, but we need to settle on something this week.
Let's be even more clear:
cloudvirt1001-dev cloudcontrol1001-dev cloudservices1001-dev cloudnodepool1001-dev
or
cloudvirt1001-devel cloudcontrol1001-devel cloudservices1001-devel cloudnodepool1001-devel
or
cloudvirt1001-test cloudcontrol1001-test cloudservices1001-test cloudnodepool1001-test
This means, using a word suffix which is clear and meaningful to the eye. If you don't like dashes '-', then without it.
cloudvirt1001devel cloudcontrol1001devel cloudservices1001devel cloudnodepool1001devel
We could use the 'devel' keyword for new servers which are being developed, before they get intro production. And then, we could use the 'test' keyword for staging environments. Of course we can use just one, I don't mind, the main point of my proposal is the visual word prefix.
Thank you Arturo for reading and thinking about it :)
A few thoughts:
* Is dev or devel better than 'test'? I'm not sure. rebase-dev* variants do exist but I recall the gnashing of teeth at that situation. I don't have a strong opinion, I think dev is a little better than test.
* On server build pipeline:
We could use the 'devel' keyword for new servers which are being developed, before they get intro production.
Small note that this isn't the workflow here (if I'm understanding correctly), servers go in with whatever name they will live with. We don't historically move servers along a pipeline in this way and server renaming is a pain in that it needs to follow down the path of puppet, rackspace, and all the way to the physical tag on the server via dcops.
* cloudvirt1001-dev I think is not a strong candidate as there is a strong precedent (probably a necessity) for not having anything after the numerical range assigned by site at https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Server.... I think anything like foo1001-dev will be enough of a special snowflake to break any reasonable regex in existence. The only similar example I can find is /^restbase-dev100[4-6].eqiad.wmnet$/ (i.e. foo-dev1001).
* I hate dashes in places where it will translate to Puppet or another medium that cannot handle the character. I have gone down this road at a previous place and it turned into a nightmare of constantly replacing -'s with _ and interspersing them until your brain couldn't tell which was which. That's entirely my historical perspective though and I won't be persnickety if everyone feels differently. I'm thinking about the bleedover for names/roles/deployments/regions/availability zones at the moment and I acknowledge that for readability something could work out.
* A small thought on meaningfulness of prefixes, symbolism, and suffixes :) There is a battle between first 5 minutes obviousness and long term practical mental modeling IMO. Nothing at https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Datace... or https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Networ... passes the 'first 5 minutes most meaningful name' test, but still works well. It's after the first 5 minutes where we are going to be living the majority of the time so I'm OK with having to digest a keyword or indicator. Attempting to have verbose and obvious lamen indicators, vs symbolism or inferred meaning, isn't bad but it also isn't going to be practical. Readability and consistency are my hopeful standards. That being said cloudvirt2001 vs cloudivirt2001 vs cloudvirti2001 are all pretty crap for readability....so...counter proposal (assuming we like 'dev')...
lab => cloud*
cloudvirt1001 cloudcontrol1001 cloudservices1001
labtest => cloud*-dev*
cloudcontrol-dev2003 cloudvirt-dev2001 cloudvirt-dev2002
# Once the current nova-network setup is retired we end up at deployment 1 in eqiad eqiad1 -> eqiad1r --> eqiad1rb --> eqiad1rc
# role::wmcs::openstack::codfw1dev::control codfw1dev -> codfw1rdev --> codfw1rbdev
# role::wmcs::openstack::codfw2dev::control codfw2dev -> codfw2rdev --> codfw2rbdev
# Future customer facing deployment # role::wmcs::openstack::codfw3::control codfw3 -> codfw3r --> codfw3rb
A part of my heart hurts that things with 'dev' and not are named in the same numeric series but it's way more confusing to have codfw2 and codfw2-dev than to just make codfw2 and codfw3-dev IMO. It's true that server names do not make that many appearances in Puppet themselves. I don't love it but if we limit the use of dashes to there and the rest of the use cases are still readable that seems survivable to me. We end up with codfw2rbdev though :)
On Mon, Feb 26, 2018 at 9:00 AM, Chase Pettet cpettet@wikimedia.org wrote:
== On OpenStack and Ubuntu/Debian ==
Ack. If we can get to Pike without having to solve the more advanced OpenStack sourcing issue we are buying some time for Kolla or some other upstream solution that we can adopt rather than rolling our own deploy-from-source setup.
== On monitoring and alerting ==
Ack. Anything all y'all roots can work out for alerting is a-ok with me. Honestly my involvement at this point is figuring out how to sign the checks if we need to move to something like PagerDuty.
== Naming (the worst of all things) ==
==== cloud ====
Ack. "cloud" works for me.
==== labtest ====
Or open to suggestion, but we need to settle on something this week.
Half ack.
My main input here is that "i" for "internal" makes some logical sense, but visually "cloudfoo" vs "cloudifoo" is hard to spot the difference (at least in the tiny fonts I use for things). "cloudxfoo" or "cloudwfoo" has more visual differentiation. What do "x" or "w" stand for? meh. we can make up a backronym after picking a visually 'wide' character like x, w, or m.
==== deployments and regions (oh my) =====
===== proposed situation =====
Under the flat domain constraints that the larger SRE group seems to favor, I feel that all hierarchies are going to look and feel weird. In a more perfect world we would use subdomains to name clusters within datacenters and get the benefit of *actual* hierarchy. Baring that, faux namespacing can be workable, but omitting delimiters makes visual parsing painful. I get the s/-/_/ horror, but since puppet bans '-' and dns bans '_' it seems inevitable.
In the big picture though, I'm fine with pretty much anything that all y'all can agree on. We have 5 SREs on the team, so ties should be difficult. ;)
I, absurdly, have more to write but this is enough for a single email. Implications for Neutron actually happening, Debian, next wave of reboots, team practices, and more will be separate. Please ack this and provide feedback or I'm a runaway train.
Choo choo! Making decisions is hard, but it has to be done. I'm glad that you are doing it, and also taking the time to ask for feedback.
Bryan
Now that I'm aware we have the policy or non changing names and not reusing them, I believe is good we invest the required time in discussing the schemes and make proper choice / changes.
== On dashes in puppet ==
I agree that s/-/_/ is something to avoid, totally. But if p4 supports that, we could take advantage of it, unless someone expects our puppet code to run in p<4 After a quick search in google, I think the limitation was a thing in old puppet versions. No idea if this is actually true.
Could anyone confirm if this limitation applies in current times?
== On naming ==
A name like "codfw2rbdev" or "codfwi2rb" is simply non-pronounceable for me. I could end summoning Sauron ... :-) The language is that of Mordor, which I will not utter here..[0] Going back to your scheme:
[site][numeric](deployment) -> [site][numeric][r postfix for region] (region) --> [site][numeric][region][letter postfix for row] (availability zone -- indicator for us that will last a long time I expect)
Perhaps we could reduce the amount of keywords. For example, site and region could be redundant, since a site implies a region.
codfw <-> na-center eqiad <-> na-east ulsfo <-> na-west esams <-> eu-center
This is your meaning of region, right?
Also, it seems that all components are "variables" and there are no common keyword for referring to them. What about using 'cloudvps'? (or even, 'cloud')
Using all of the above, then the idea is:
cloudvps-eqiad1 <-- prod deployment in eqiad number 1 cloudvps-eqiad1b <-- prod deployment in eqiad number 1 row b cloudvps-eqiad1c <-- prod deployment in eqiad number 1 row c
(region is implied na-east)
cloudvps-dev-codfw2 <--devel deployment in codfw number 2 cloudvps-dev-codfw2b <--devel deployment in codfw number 2 row b cloudvps-dev-codfw2c <--devel deployment in codfw number 2 row c
(region is implied na-center)
Again, I will push for using a complete keyword 'dev' or something similar, rather than a letter.
I agree with Bryan, what we are looking for is domains and subdomains. Using domains and subdomains could solve most of our problems with naming. Are we reinventing the wheel with all these naming schemes? Perhaps we could simply invest our effort in addressing blockers for using subdomains.
[0] https://en.wikiquote.org/wiki/The_Lord_of_the_Rings:_The_Fellowship_of_the_R...
For sure dashes are illegal characters in Puppet up-to-and-including P5+. See reserved words and acceptable https://puppet.com/docs/puppet/5.0/lang_reserved.html names https://puppet.com/docs/puppet/5.0/lang_reserved.html and the style guide https://puppet.com/docs/puppet/5.3/style_guide.html.
A name like "codfw2rbdev" or "codfwi2rb" is simply non-pronounceable for me. I could end summoning Sauron ... :-) The language is that of
What tends to happen is that things that cannot be phonetic become elongated.
i.e. "'dallas deployment 2'..is that a dev deployment? oh yeah it is."
But within a day or two you'll never have to look to verify if it's a dev deployment because we are unlikely have more than 5 deployments for the next few years. I very respectfully disagree that we'll be able to find a phonetic name that conveys the information here, and especially for deeper context to come, it's dense but I think it will be OK.
Perhaps we could reduce the amount of keywords. For example, site and region could be redundant, since a site implies a region.
codfw <-> na-center eqiad <-> na-east ulsfo <-> na-west esams <-> eu-center
Site and region are not analogous so it doesn't work to fold the identifier into one. We will certainly have 2 regions within codfw within the next 15m and possibly 2 within eqiad shortly thereafter. Even for AWS this doesn't work the way it seems on the surface in that while you choose a region+availability-zone it isn't really that persay, they mask what the underlying deployment is so us-west-1a for 2 companies can be different under the covers https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html. We won't have that abstraction to my knowledge. AWS does some interesting things to keep customer interfacing consistent even when it's not.
Also, it seems that all components are "variables" and there are no common keyword for referring to them. What about using 'cloudvps'? (or even, 'cloud')
Using all of the above, then the idea is:
cloudvps-eqiad1 <-- prod deployment in eqiad number 1 cloudvps-eqiad1b <-- prod deployment in eqiad number 1 row b cloudvps-eqiad1c <-- prod deployment in eqiad number 1 row c
(region is implied na-east)
cloudvps-dev-codfw2 <--devel deployment in codfw number 2 cloudvps-dev-codfw2b <--devel deployment in codfw number 2 row b cloudvps-dev-codfw2c <--devel deployment in codfw number 2 row c
(region is implied na-center)
I don't agree a prepend designating cloudvps is needed here. To my mind this would be like a prepend for WMF servers that indicate it is for WMF, since there isn't another option it is intrinsic.
Something like: '>cloudvps-eqiad1b <-- prod deployment in eqiad number 1 row b'
Wouldn't make sense as it's possible to have regions overlap rows in the future, site and deployment are not 1:1, and prefacing with cloudvps asks the question of: when is it not cloudvps to make the modifyer useful?
I agree with Bryan, what we are looking for is domains and subdomains. Using domains and subdomains could solve most of our problems with naming. Are we reinventing the wheel with all these naming schemes?
I don't think subdomains would solve all of our problems here, but in the case of a cloud vs cloudtest, yeah it may be a nicer model for sure. I don't think it's worth pursuing now as it would be serious effort for a little return and it doesn't move forward any of the things that we are stuck on.
I have been thinking on where we could incur some readability and not suffer a long series of painful context switching. I think potentially:
lab => cloud*
cloudvirt1001 cloudcontrol1001 cloudservices1001
labtest => cloud*-dev*
cloudcontrol-dev2003 cloudvirt-dev2001 cloudvirt-dev2002
# Once the current nova-network setup is retired we end up at deployment 1 in eqiad eqiad1 -> eqiad1-r --> eqiad1-rb --> eqiad1-rc
"eq one" "eqiad one" "eqiad one AV b" "eqiad one AV c"
# role::wmcs::openstack::codfw1dev::control codfw1dev -> codfw1-rdev --> codfw1-rbdev
# role::wmcs::openstack::codfw2dev::control codfw2dev -> codfw2-rdev --> codfw2-rbdev
"dallas deployment 2" "dallas 2 AV b"
# Future customer facing deployment # role::wmcs::openstack::codfw3::control codfw3 -> codfw3-r --> codfw3-rb
On 2/26/18 10:00 AM, Chase Pettet wrote:
A lot of things are in the works for which I'll either add an agenda item to the weekly or will have a followup meeting but it has reached the point of a preface email to any discussion being more efficient. /Please/ 'ack' this with a response because there are things in here that affect everyone on the team and are difficult to rewind.
== On OpenStack and Ubuntu/Debian ==
Yep! We've discussed this and I think it's the right approach. I'll be working on the Mitaka move during/right after my labweb work. I'm certainly in favor of staying on the .deb train as long as possible.
== On monitoring and alerting ==
I have made a change for myself that has the following effect: regular critical alerts are on a standard 'awake' schedule and wmcs-team alerts are still 24/7.
If this means I can stop getting paged for db-server outages in the night, I want it!
Chico has expressed a desire to contribute while IRC is dormant and we have begun a series of 1:1 conversations about our environment. He has been working on logic to alert on a portion of puppet failures https://gerrit.wikimedia.org/r/c/411315rather than than every puppet failure. This, to my mind, does not mean we have solved the puppet flapping issue but it's also not doing us any good to be fatigued by an issue we do not have time to investigate that has been seemingly benign for a year. I am considering whether we should move this to tools.checker, increase retries on our single puppet alerting logic, and add alerting to the main icinga for it. Hopefully, we can talk abou this in our meeting.
This sounds good, although I don't want to train new people that those puppet alerts are unsolvable and forever with us. I continue to hope that there's an actual fixable problem under there somewhere.
== Naming (the worst of all things) ==
==== cloud ====
'cloudvirtXXXX' sounds fine to me. Shall we start back at 001 for hosts in the new naming scheme, or continue to count up from the lab* numbering?
==== labtest ====
Any of the proposed options for this are fine with me; agreed that getting 'test' out of there should reduce confusion.
==== deployments and regions (oh my) =====
I, absurdly, have more to write but this is enough for a single email. Implications for Neutron actually happening, Debian, next wave of reboots, team practices, and more will be separate. Please ack this and provide feedback or I'm a runaway train.
I read all this and agree that deployment/region names will be a terrible mouthful, but don't have better ideas. I'll leave this for you and Arturo to hash out :)
Thanks for thinking all this through!
-A
cloud-admin@lists.wikimedia.org