There is an entire hours worth of talk about hiera I imagine, and I talked a bit with Brooke about similar things last week.  Let's do that maybe wed or thu (talk about hiera from end-to-end).  Briefly though:

* I don't mind swapping a key in hiera for failover.  A service url would probably be more sane for actual clients, but we can do that post all this regardless. doing carbon relay duplication seems fine to me too (https://phabricator.wikimedia.org/T190512#4090428) but maybe is overcomplicated at the moment.  Whatever you think :)

* Let's not use these key paths:
> wmcs::monitoring::server labmon1001.eqiad.wmnet
> wmcs::monitoring::server_standby labmon1002.eqiad.wmnet

Let's instead for main/eqiad0 and everything that is actually effected here use this key path as authoritative for now for the few production things that read this value:

> common/profile/openstack/main.yaml:profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet'

And these key values for cloud tenants (this is where it's really meaningful):

> labs.yaml:statsd: labmon1001.eqiad.wmnet:8125
> labs.yaml:statsite::instance::graphite_host: 'labmon1001.eqiad.wmnet'

unfortunately I don't know why these were duplicated originally so I'm not sure how deep the cleanup would go but for now let's just keep both.

Breaking down existing instances of the value labmon1001.eqiad.wmnet in hiera:

> common/profile/openstack/main.yaml:profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet'
> common/profile/openstack/base.yaml:profile::openstack::base::monitoring_host: 'labmon1001.eqiad.wmnet'
> common/profile/openstack/labtest.yaml:profile::openstack::labtest::statsd_host: 'labmon1001.eqiad.wmnet'

Yes this is a per-deployment value atm even though we don't really have a per deployment graphite instance.  I'm not too worried about this duplication as profile::openstack::main::statsd_host should be the only used key where it isn't just filling a dummy role.  We would fold these into base but probably not in scope atm.

> labs/deployment-prep/common.yaml:service::configuration::statsd_host: labmon1001.eqiad.wmnet
> labs/deployment-prep/common.yaml:graphite_host: labmon1001.eqiad.wmnet
> labs/deployment-prep/common.yaml:statsd: labmon1001.eqiad.wmnet:8125
> labs/deployment-prep/common.yaml:role::logstash::collector::statsd_host: labmon1001.eqiad.wmnet

Deployment-prep specific values.  let's not worry about this for now.  A long list of hiera and puppet cleanup is necessary and I don't know why deployment-prep was ever set specifically and let's let them worry about it for now.

> labs.yaml:statsd: labmon1001.eqiad.wmnet:8125
> labs.yaml:statsite::instance::graphite_host: 'labmon1001.eqiad.wmnet'

Actual values pulled down by cloud instances, and I'm fine w/ these being dupe for the moment.  But no need to add another value either :)

> role/common/labs/puppetmaster.yaml:labspuppetbackend::statsd_host: "labmon1001.eqiad.wmnet"

This should really be pulling from the deployment specific value but that probably requires some refactor.

> role/common/cache/misc.yaml:      eqiad: 'labmon1001.eqiad.wmnet'

labmon1001 sits behind varnish iirc and this is setup for that.  I don't think this area of things is hiera-ized really and so let's just leave this alone for now.  Suffice it to say there are two sides to teh failover from a labmon1001 to a labmon1002. The population side (where a changed hiera key will be the deal here for now) and the consumer side (where varnish knows to send https://graphite-labs.wikimedia.org/.  That failover is less time-critical and would be changes here iiuc.

> role/common/labsencapi.yaml:profile::puppetmaster::labsencapi::statsd_host: "labmon1001.eqiad.wmnet"

This is ideally a deployment specific value and feeds off of profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet' for the most part for now.

---

And at the end, this doesn't seem to work because:
* my new hiera keys are not found (why hieradata/labs.yaml is never read?)

hieradata/labs.yaml is never read from any production host.  There are two hiera trees in use and they also use different logic so while there appears to be key path lookup overlap between production hosts and cloud instances there really is not (other than some not good scenarios that we won't discuss here). i.e. you cannot have one value to rule them all because cloud instances and production hosts do not read the same config and even when they do they do not read it in the same way :)  So plan on 2 values to rule them all.  One per deployment and one in labs.yaml for instances (ok that's 3 at least unf).

* some other weirdness unknown to me

So much :)

* isn't there a way to introduce a global hiera key for all our environment?

No, we don't want to do this in theory depending on what you mean by 'our environment'.  Count on clloud instances seeing hiera differently from production hosts and that being OK. If we wanted a value that was widely used for our deployment we would use common/ and base and let's still put it under a profile:: path.

--

Skipping over a lot of "it would be nice if...":

* Use profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet' as authoritative for production services not running in cloud
* use labs.yaml:statsd: labmon1001.eqiad.wmnet:8125 or labs.yaml:statsite::instance::graphite_host: 'labmon1001.eqiad.wmnet' as authoritative when seen by cloud instances

On Mon, Apr 2, 2018 at 4:45 AM, Arturo Borrero Gonzalez <aborrero@wikimedia.org> wrote:
Hi folks!

I'm trying to setup labmon1002 as a cold standby for labmon1001.
We need to sync the whisper files from one server to another, so in case
we lost labmon1001 we don't lost all metrics.

Regarding hiera, in my mind it was as simpler as having 2 hiera keys
(names aren't set in stone):

* wmcs::monitoring::server labmon1001.eqiad.wmnet
* wmcs::monitoring::server_standby labmon1002.eqiad.wmnet

And then:

* have all clients send data to 'wmcs::monitoring::server'
* In case of outage, simple flip the keys
* the rsync cronjob is in server 'wmcs::monitoring::server_standby'

If you grep the ops/puppet.git repo, you may find *a lot* of calls
to 'labmon1001.eqiad.wmnet'. Examples:

* hieradata/common/profile/openstack/labtest.yaml
profile::openstack::labtest::statsd_host: 'labmon1001.eqiad.wmnet'

* hieradata/common/profile/openstack/main.yaml
profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet'

* hieradata/labs/deployment-prep/common.yaml
service::configuration::statsd_host: labmon1001.eqiad.wmnet

* hieradata/labs/deployment-prep/common.yaml
graphite_host: labmon1001.eqiad.wmnet

To improve a bit maintainability, I thought of using a single hiera key,
the toplevel 'wmcs::monitoring::server', so in case of an outage, we
don't have to update a lot of LOCs to point to the standby server.
This is, some kind of code factorization.

Hiera is a new thing to me, and I've been doing some testing, test
compilations and playing with tools/hiera_lookup [0].
And at the end, this doesn't seem to work because:
* my new hiera keys are not found (why hieradata/labs.yaml is never read?)
* some other weirdness unknown to me
* isn't there a way to introduce a global hiera key for all our environment?

So, would you please share some hints? What do you think about this
whole picture? Do you have any suggestion for the hiera keys layout?

Thanks in advance for your time! :-)


Relevant phabricator tasks:
 * labmon1002 as cold standby for labmon1001
 ** https://phabricator.wikimedia.org/T189871
 * labmon: syncronize whisper files between labmon1001 and labmon1002
 ** https://phabricator.wikimedia.org/T190512

[0] cmdline used are things like:

% utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
--roles=labs::monitoring profile::labs::monitoring::master -v
% utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
profile::labs::monitoring::master -v



--
Chase Pettet
chasemp on phabricator and IRC