Tue:
* Quiet day on irc
* Meeting notes to wiki
* Updates to tech mgrs meeting
* Updates to SoS etherpad
Wed:
* Security bug reported by eddiegp
** Handed off to Andrew after some discussion
** https://phabricator.wikimedia.org/T191433
* labs-graphite io spike made nagf unresponsive
* declined wikimisc project for lack of community support
<https://phabricator.wikimedia.org/T191155>
* declined sau226test project as a laptop in the cloud
<https://phabricator.wikimedia.org/T190852>
* worked on some maintain-views requests/bugs
** https://phabricator.wikimedia.org/T191455
** https://phabricator.wikimedia.org/T191387
** https://phabricator.wikimedia.org/T191380
* Pinged on https://phabricator.wikimedia.org/T181679 to see if
cleanup can start
Thu:
* helped Jon Robson rescue a VM with a full disk
** This lead to a Puppet patch for mediawiki-vagrant sudoers rules
Fri:
* Cleaned up reading-web-staging-3.reading-web-staging.eqiad.wmflabs
Puppet state. Follow on from Thursday's work.
* Ran maintain-views to purge old mediawikiwiki tables
<https://phabricator.wikimedia.org/T191387>
* Found and fixed another Vagrant sudoers rule bug
Sat & Sun:
* (stayed offline)
*
Mon:
* Tried to run `sudo maintain-views --clean --all-databases
--replace-all` on labsdb1009. Failed due to lock wait timeout in ...
some database.
* SRE meeting (see below for callouts)
* long session trying to help Moriel with a MediaWiki-Vagrant issue
SRE:
* ICU 57 rollout in progress (PHP7 blocker)
* All prod maintenance scripts to use "php" (HHVM on Trusty) starting today
* MW servers moving to Stretch during Q4
* DBAs getting quotes on additional sanitarium servers
* Jaime wants a meeting about <https://phabricator.wikimedia.org/T189542> (m5)
** Bryan and Andrew will meet with him this week to figure things out
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
2018-04-04 20:00:02,806 INFO force is enabled
2018-04-04 20:00:02,864 INFO removing misc-project-backup
2018-04-04 20:00:02,982 INFO removing misc-project-backup
2018-04-04 20:00:03,856 INFO creating misc-project-backup at 2T
2018-04-04 20:00:04,784 INFO force is enabled
2018-04-04 20:00:04,828 INFO removing misc-snap
2018-04-04 20:00:04,888 INFO removing misc-snap
2018-04-04 20:00:05,244 INFO creating misc-snap at 1T
Sometime soon we need to upgrade our OpenStack deployment to the
next release, 'Mitaka'. I've done a test upgrade and the process was
fairly smooth, but there is at least one step that will cause
unavoidable downtime for new instance creation. Ideally this will only
take around 20 minutes, but given the number of surprises I ran into
just now it wouldn't shock me if it winds up taking several hours instead.
I propose to do this upgrade starting at the beginning of my day on
next Friday, April 13th. Unlucky number, but being a Friday there are
no active MediaWiki deployments so the lack of CI should be less
disruptive than usual. The next day is largely unscheduled for me, and
the following Monday is a WMF holiday so that gives us an entire
four-day block to back out any possible disasters before we're really
stepping on release engineering's toes.
The upgrade should not interfere with existing VMs. If there are
no objections, I'll send a public announcement about this tomorrow.
-Andrew
2018-04-03 20:00:02,895 INFO force is enabled
2018-04-03 20:00:02,943 INFO removing tools-project-backup
2018-04-03 20:00:03,001 INFO removing tools-project-backup
2018-04-03 20:00:03,447 INFO creating tools-project-backup at 2T
2018-04-03 20:00:04,310 INFO force is enabled
2018-04-03 20:00:04,347 INFO removing tools-snap
2018-04-03 20:00:04,396 INFO removing tools-snap
2018-04-03 20:00:06,032 INFO creating tools-snap at 1T
Hi folks!
I'm trying to setup labmon1002 as a cold standby for labmon1001.
We need to sync the whisper files from one server to another, so in case
we lost labmon1001 we don't lost all metrics.
Regarding hiera, in my mind it was as simpler as having 2 hiera keys
(names aren't set in stone):
* wmcs::monitoring::server labmon1001.eqiad.wmnet
* wmcs::monitoring::server_standby labmon1002.eqiad.wmnet
And then:
* have all clients send data to 'wmcs::monitoring::server'
* In case of outage, simple flip the keys
* the rsync cronjob is in server 'wmcs::monitoring::server_standby'
If you grep the ops/puppet.git repo, you may find *a lot* of calls
to 'labmon1001.eqiad.wmnet'. Examples:
* hieradata/common/profile/openstack/labtest.yaml
profile::openstack::labtest::statsd_host: 'labmon1001.eqiad.wmnet'
* hieradata/common/profile/openstack/main.yaml
profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet'
* hieradata/labs/deployment-prep/common.yaml
service::configuration::statsd_host: labmon1001.eqiad.wmnet
* hieradata/labs/deployment-prep/common.yaml
graphite_host: labmon1001.eqiad.wmnet
To improve a bit maintainability, I thought of using a single hiera key,
the toplevel 'wmcs::monitoring::server', so in case of an outage, we
don't have to update a lot of LOCs to point to the standby server.
This is, some kind of code factorization.
Hiera is a new thing to me, and I've been doing some testing, test
compilations and playing with tools/hiera_lookup [0].
And at the end, this doesn't seem to work because:
* my new hiera keys are not found (why hieradata/labs.yaml is never read?)
* some other weirdness unknown to me
* isn't there a way to introduce a global hiera key for all our environment?
So, would you please share some hints? What do you think about this
whole picture? Do you have any suggestion for the hiera keys layout?
Thanks in advance for your time! :-)
Relevant phabricator tasks:
* labmon1002 as cold standby for labmon1001
** https://phabricator.wikimedia.org/T189871
* labmon: syncronize whisper files between labmon1001 and labmon1002
** https://phabricator.wikimedia.org/T190512
[0] cmdline used are things like:
% utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
--roles=labs::monitoring profile::labs::monitoring::master -v
% utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
profile::labs::monitoring::master -v