I opened https://phabricator.wikimedia.org/T192422 and depooled
labvirt1015 for now. I don't know that this is actually cause for
alarm, but 97 VMs seems like a lot of eggs to have in one basket.
-A
-------- Forwarded Message --------
Subject: ** PROBLEM alert - labvirt1015/ensure kvm processes are
running is CRITICAL **
Date: Wed, 18 Apr 2018 01:17:17 +0000
From: icinga(a)einsteinium.wikimedia.org
To: abogott(a)wikimedia.org
Notification Type: PROBLEM
Service: ensure kvm processes are running
Host: labvirt1015
Address: 10.64.20.31
State: CRITICAL
Date/Time: Wed Apr 18 01:17:17 UTC 2018
Notes URLs:
Additional Info:
PROCS CRITICAL: 97 processes with regex args /usr/bin/kvm
I'm around now but I'm trying to handle our 2y old so my wife can get some
sleep. Our 6y old was up all night with a stomach that couldn't hold
anything down. I will spare everyone the details but it's pretty brutal.
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
2018-04-11 20:00:02,533 INFO force is enabled
2018-04-11 20:00:02,572 INFO removing misc-project-backup
2018-04-11 20:00:02,654 INFO removing misc-project-backup
2018-04-11 20:00:03,144 INFO creating misc-project-backup at 2T
2018-04-11 20:00:04,043 INFO force is enabled
2018-04-11 20:00:04,107 INFO removing misc-snap
2018-04-11 20:00:04,155 INFO removing misc-snap
2018-04-11 20:00:04,428 INFO creating misc-snap at 1T
* Bryan to ping Eliza about usage of PagerDuty by OIT to see if there
is a way we could trial it
* Rotating lead for weekly meeting: come up with a plan and do it
* Chase & Brooke to work on Puppet state of the union doc and next
steps ideas to bring back to group
* Sarah to talk with Arturo and Brooke about onboarding issues for doc
improvements
* James to give this page a section on the mainpage
<https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences>
Things we didn't have time to talk about:
* Do we need to make more distinction between site-maintaining people
and others? ++
* Planning :) How the heck are we going to do all the things? Gotta
get ruthless in casting things off we can't do. -- yes but not now bc
tired :) +
* Question I have from time to time: am I working enough, performance-wise? +
* Excellent effort at making the team feel like a team of equals
despite realities of contractor status+
Any/all of these could be topics for future team meetings. We could do
a meeting or two where we talk about topics such as these instead of
project updates and just read the update notes offline instead.
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
2018-04-10 20:00:03,216 INFO force is enabled
2018-04-10 20:00:03,244 INFO removing tools-project-backup
2018-04-10 20:00:03,341 INFO removing tools-project-backup
2018-04-10 20:00:03,850 INFO creating tools-project-backup at 2T
2018-04-10 20:00:04,611 INFO force is enabled
2018-04-10 20:00:04,641 INFO removing tools-snap
2018-04-10 20:00:04,689 INFO removing tools-snap
2018-04-10 20:00:05,849 INFO creating tools-snap at 1T
Tue:
* Quiet day on irc
* Meeting notes to wiki
* Updates to tech mgrs meeting
* Updates to SoS etherpad
Wed:
* Security bug reported by eddiegp
** Handed off to Andrew after some discussion
** https://phabricator.wikimedia.org/T191433
* labs-graphite io spike made nagf unresponsive
* declined wikimisc project for lack of community support
<https://phabricator.wikimedia.org/T191155>
* declined sau226test project as a laptop in the cloud
<https://phabricator.wikimedia.org/T190852>
* worked on some maintain-views requests/bugs
** https://phabricator.wikimedia.org/T191455
** https://phabricator.wikimedia.org/T191387
** https://phabricator.wikimedia.org/T191380
* Pinged on https://phabricator.wikimedia.org/T181679 to see if
cleanup can start
Thu:
* helped Jon Robson rescue a VM with a full disk
** This lead to a Puppet patch for mediawiki-vagrant sudoers rules
Fri:
* Cleaned up reading-web-staging-3.reading-web-staging.eqiad.wmflabs
Puppet state. Follow on from Thursday's work.
* Ran maintain-views to purge old mediawikiwiki tables
<https://phabricator.wikimedia.org/T191387>
* Found and fixed another Vagrant sudoers rule bug
Sat & Sun:
* (stayed offline)
*
Mon:
* Tried to run `sudo maintain-views --clean --all-databases
--replace-all` on labsdb1009. Failed due to lock wait timeout in ...
some database.
* SRE meeting (see below for callouts)
* long session trying to help Moriel with a MediaWiki-Vagrant issue
SRE:
* ICU 57 rollout in progress (PHP7 blocker)
* All prod maintenance scripts to use "php" (HHVM on Trusty) starting today
* MW servers moving to Stretch during Q4
* DBAs getting quotes on additional sanitarium servers
* Jaime wants a meeting about <https://phabricator.wikimedia.org/T189542> (m5)
** Bryan and Andrew will meet with him this week to figure things out
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855
2018-04-04 20:00:02,806 INFO force is enabled
2018-04-04 20:00:02,864 INFO removing misc-project-backup
2018-04-04 20:00:02,982 INFO removing misc-project-backup
2018-04-04 20:00:03,856 INFO creating misc-project-backup at 2T
2018-04-04 20:00:04,784 INFO force is enabled
2018-04-04 20:00:04,828 INFO removing misc-snap
2018-04-04 20:00:04,888 INFO removing misc-snap
2018-04-04 20:00:05,244 INFO creating misc-snap at 1T
Sometime soon we need to upgrade our OpenStack deployment to the
next release, 'Mitaka'. I've done a test upgrade and the process was
fairly smooth, but there is at least one step that will cause
unavoidable downtime for new instance creation. Ideally this will only
take around 20 minutes, but given the number of surprises I ran into
just now it wouldn't shock me if it winds up taking several hours instead.
I propose to do this upgrade starting at the beginning of my day on
next Friday, April 13th. Unlucky number, but being a Friday there are
no active MediaWiki deployments so the lack of CI should be less
disruptive than usual. The next day is largely unscheduled for me, and
the following Monday is a WMF holiday so that gives us an entire
four-day block to back out any possible disasters before we're really
stepping on release engineering's toes.
The upgrade should not interfere with existing VMs. If there are
no objections, I'll send a public announcement about this tomorrow.
-Andrew
2018-04-03 20:00:02,895 INFO force is enabled
2018-04-03 20:00:02,943 INFO removing tools-project-backup
2018-04-03 20:00:03,001 INFO removing tools-project-backup
2018-04-03 20:00:03,447 INFO creating tools-project-backup at 2T
2018-04-03 20:00:04,310 INFO force is enabled
2018-04-03 20:00:04,347 INFO removing tools-snap
2018-04-03 20:00:04,396 INFO removing tools-snap
2018-04-03 20:00:06,032 INFO creating tools-snap at 1T
Hi folks!
I'm trying to setup labmon1002 as a cold standby for labmon1001.
We need to sync the whisper files from one server to another, so in case
we lost labmon1001 we don't lost all metrics.
Regarding hiera, in my mind it was as simpler as having 2 hiera keys
(names aren't set in stone):
* wmcs::monitoring::server labmon1001.eqiad.wmnet
* wmcs::monitoring::server_standby labmon1002.eqiad.wmnet
And then:
* have all clients send data to 'wmcs::monitoring::server'
* In case of outage, simple flip the keys
* the rsync cronjob is in server 'wmcs::monitoring::server_standby'
If you grep the ops/puppet.git repo, you may find *a lot* of calls
to 'labmon1001.eqiad.wmnet'. Examples:
* hieradata/common/profile/openstack/labtest.yaml
profile::openstack::labtest::statsd_host: 'labmon1001.eqiad.wmnet'
* hieradata/common/profile/openstack/main.yaml
profile::openstack::main::statsd_host: 'labmon1001.eqiad.wmnet'
* hieradata/labs/deployment-prep/common.yaml
service::configuration::statsd_host: labmon1001.eqiad.wmnet
* hieradata/labs/deployment-prep/common.yaml
graphite_host: labmon1001.eqiad.wmnet
To improve a bit maintainability, I thought of using a single hiera key,
the toplevel 'wmcs::monitoring::server', so in case of an outage, we
don't have to update a lot of LOCs to point to the standby server.
This is, some kind of code factorization.
Hiera is a new thing to me, and I've been doing some testing, test
compilations and playing with tools/hiera_lookup [0].
And at the end, this doesn't seem to work because:
* my new hiera keys are not found (why hieradata/labs.yaml is never read?)
* some other weirdness unknown to me
* isn't there a way to introduce a global hiera key for all our environment?
So, would you please share some hints? What do you think about this
whole picture? Do you have any suggestion for the hiera keys layout?
Thanks in advance for your time! :-)
Relevant phabricator tasks:
* labmon1002 as cold standby for labmon1001
** https://phabricator.wikimedia.org/T189871
* labmon: syncronize whisper files between labmon1001 and labmon1002
** https://phabricator.wikimedia.org/T190512
[0] cmdline used are things like:
% utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
--roles=labs::monitoring profile::labs::monitoring::master -v
% utils/hiera_lookup --fqdn=labmon1002.eqiad.wmnet
profile::labs::monitoring::master -v