There is an entire hours worth of talk about hiera I imagine, and I talked
a bit with Brooke about similar things last week. Let's do that maybe wed
or thu (talk about hiera from end-to-end). Briefly though:
* I don't mind swapping a key in hiera for failover. A service url would
probably be more sane for actual clients, but we can do that post all this
regardless. doing carbon relay duplication seems fine to me too (
https://phabricator.wikimedia.org/T190512#4090428) but maybe is
overcomplicated at the moment. Whatever you think :)
* Let's not use these key paths:
wmcs::monitoring::server labmon1001.eqiad.wmnet
wmcs::monitoring::server_standby labmon1002.eqiad.wmnet
Let's instead for main/eqiad0 and everything that is actually effected here
use this key path as authoritative for now for the few production things
that read this value:
common/profile/openstack/main.yaml:profile::openstack::main::statsd_host:
'labmon1001.eqiad.wmnet'
And these key values for cloud tenants (this is where it's really
meaningful):
labs.yaml:statsd: labmon1001.eqiad.wmnet:8125
labs.yaml:statsite::instance::graphite_host: 'labmon1001.eqiad.wmnet'
unfortunately I don't know why these were duplicated originally so I'm not
sure how deep the cleanup would go but for now let's just keep both.
Breaking down existing instances of the value labmon1001.eqiad.wmnet in
hiera:
common/profile/openstack/main.yaml:profile::openstack::main::statsd_host:
'labmon1001.eqiad.wmnet'
common/profile/openstack/base.yaml:profile::openstack::base::monitoring_host:
'labmon1001.eqiad.wmnet'
common/profile/openstack/labtest.yaml:profile::openstack::labtest::statsd_host:
'labmon1001.eqiad.wmnet'
Yes this is a per-deployment value atm even though we don't really have a
per deployment graphite instance. I'm not too worried about this
duplication as profile::openstack::main::statsd_host should be the only
used key where it isn't just filling a dummy role. We would fold these
into base but probably not in scope atm.
labs/deployment-prep/common.yaml:service::configuration::statsd_host:
labmon1001.eqiad.wmnet
labs/deployment-prep/common.yaml:graphite_host:
labmon1001.eqiad.wmnet
labs/deployment-prep/common.yaml:statsd: labmon1001.eqiad.wmnet:8125
labs/deployment-prep/common.yaml:role::logstash::collector::statsd_host:
labmon1001.eqiad.wmnet
Deployment-prep specific values. let's not worry about this for now. A
long list of hiera and puppet cleanup is necessary and I don't know why
deployment-prep was ever set specifically and let's let them worry about it
for now.
labs.yaml:statsd: labmon1001.eqiad.wmnet:8125
labs.yaml:statsite::instance::graphite_host: 'labmon1001.eqiad.wmnet'
Actual values pulled down by cloud instances, and I'm fine w/ these being
dupe for the moment. But no need to add another value either :)
role/common/labs/puppetmaster.yaml:labspuppetbackend::statsd_host:
"labmon1001.eqiad.wmnet"
This should really be pulling from the deployment specific value but that
probably requires some refactor.
role/common/cache/misc.yaml: eqiad:
'labmon1001.eqiad.wmnet'
labmon1001 sits behind varnish iirc and this is setup for that. I don't
think this area of things is hiera-ized really and so let's just leave this
alone for now. Suffice it to say there are two sides to teh failover from
a labmon1001 to a labmon1002. The population side (where a changed hiera
key will be the deal here for now) and the consumer side (where varnish
knows to send
https://graphite-labs.wikimedia.org/. That failover is less
time-critical and would be changes here iiuc.
role/common/labsencapi.yaml:profile::puppetmaster::labsencapi::statsd_host:
"labmon1001.eqiad.wmnet"
This is ideally a deployment specific value and feeds off of
profile::openstack::main::statsd_host:
'labmon1001.eqiad.wmnet' for the most part for now.
---
And at the end, this doesn't seem to work because:
* my new hiera keys are not found (why hieradata/labs.yaml is never read?)
hieradata/labs.yaml is never read from any production host. There are two
hiera trees in use and they also use different logic so while there appears
to be key path lookup overlap between production hosts and cloud instances
there really is not (other than some not good scenarios that we won't
discuss here). i.e. you cannot have one value to rule them all because
cloud instances and production hosts do not read the same config and even
when they do they do not read it in the same way :) So plan on 2 values to
rule them all. One per deployment and one in labs.yaml for instances (ok
that's 3 at least unf).
* some other weirdness unknown to me
So much :)
* isn't there a way to introduce a global hiera key for all our environment?
No, we don't want to do this in theory depending on what you mean by 'our
environment'. Count on clloud instances seeing hiera differently from
production hosts and that being OK. If we wanted a value that was widely
used for our deployment we would use common/ and base and let's still put
it under a profile:: path.
--
Skipping over a lot of "it would be nice if...":
* Use profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet' as
authoritative for production services not running in cloud
* use labs.yaml:statsd: labmon1001.eqiad.wmnet:8125
or labs.yaml:statsite::instance::graphite_host: 'labmon1001.eqiad.wmnet' as
authoritative when seen by cloud instances
On Mon, Apr 2, 2018 at 4:45 AM, Arturo Borrero Gonzalez <
aborrero(a)wikimedia.org> wrote:
> Hi folks!
> I'm trying to setup labmon1002 as a
cold standby for labmon1001.
> We need to sync the whisper files from one server to another, so in case
> we lost labmon1001 we don't lost all metrics.
> Regarding hiera, in my mind it was as
simpler as having 2 hiera keys
> (names aren't set in stone):
> * wmcs::monitoring::server
labmon1001.eqiad.wmnet
> * wmcs::monitoring::server_standby labmon1002.eqiad.wmnet
> And then:
> * have all clients send data to
'wmcs::monitoring::server'
> * In case of outage, simple flip the keys
> * the rsync cronjob is in server 'wmcs::monitoring::server_standby'
> If you grep the ops/puppet.git repo, you
may find *a lot* of calls
> to 'labmon1001.eqiad.wmnet'. Examples:
> *
hieradata/common/profile/openstack/labtest.yaml
> profile::openstack::labtest::statsd_host: 'labmon1001.eqiad.wmnet'
> *
hieradata/common/profile/openstack/main.yaml
> profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet'
> *
hieradata/labs/deployment-prep/common.yaml
> service::configuration::statsd_host: labmon1001.eqiad.wmnet
> *
hieradata/labs/deployment-prep/common.yaml
> graphite_host: labmon1001.eqiad.wmnet
> To improve a bit maintainability, I thought
of using a single hiera key,
> the toplevel 'wmcs::monitoring::server', so in case of an outage, we
> don't have to update a lot of LOCs to point to the standby server.
> This is, some kind of code factorization.
> Hiera is a new thing to me, and I've
been doing some testing, test
> compilations and playing with tools/hiera_lookup [0].
> And at the end, this doesn't seem to work because:
> * my new hiera keys are not found (why hieradata/labs.yaml is never read?)
> * some other weirdness unknown to me
> * isn't there a way to introduce a global hiera key for all our
> environment?
> So, would you please share some hints? What
do you think about this
> whole picture? Do you have any suggestion for the hiera keys layout?
> Thanks in advance for your time! :-)
> Relevant phabricator tasks:
> * labmon1002 as cold standby for labmon1001
> **
https://phabricator.wikimedia.org/T189871
> * labmon: syncronize whisper files between labmon1001 and labmon1002
> **
https://phabricator.wikimedia.org/T190512
> [0] cmdline used are things like:
> % utils/hiera_lookup
--fqdn=labmon1002.eqiad.wmnet
> --roles=labs::monitoring profile::labs::monitoring::master -v
> % utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
> profile::labs::monitoring::master -v
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC