Hey all,
Clush allows for host targeting using a file that has a list of hosts,
using the --hostfile option. Thanks to Riccardo's awesome work, we now have
this feature available as a Cumin backend in Cloud VPS cumin masters
(including labs-puppetmaster).
This is something we need when we have to target a list of hosts that are
not dynamically queryable, say where puppet runs are failing, or hosts that
have NFS enabled. Sometimes these lists are too long and not fun to pass as
a comma separated list of hostnames through cumin's Direct backend. Do use
with care, and make sure to check that your hostfile is current, and has
the right hosts you are looking to target. If in doubt, --dry-run :)
Docs on how to use are here -
https://wikitech.wikimedia.org/wiki/Cumin#HostFile_backend_(Enabled_only_in…
Best,
--
Madhumitha Viswanathan
Operations Engineer, Cloud Services
Traceback (most recent call last):
File "/usr/local/sbin/block_sync", line 125, in <module>
lock_file = open('/var/lock/{}_{}_backup.lock'.format(args.r_vg, args.r_lv, 'w+'))
FileNotFoundError: [Errno 2] No such file or directory: '/var/lock/misc_misc-project_backup.lock'
Traceback (most recent call last):
File "/usr/local/sbin/block_sync", line 125, in <module>
lock_file = open('/var/lock/{}_{}_backup.lock'.format(args.r_vg, args.r_lv, 'w+'))
FileNotFoundError: [Errno 2] No such file or directory: '/var/lock/tools_tools-project_backup.lock'
Note that I will be traveling tomorrow and will miss our weekly meeting
(unless today's snowstorm continues and my flight is delayed).
Note also that Franciso has been quite on top of things in
#wikimedia-cloud and I consequently didn't up doing much support there. \o/
2018-01-16:
* Lots of alerts, all of them caused by me rebooting labvirts
* A fair but of chatter on #wikimedia-cloud, also a result of the reboots
2018-01-17:
* Rush job: Created some Trusty VMs for the Discovery team to run a demo
next week: https://phabricator.wikimedia.org/T185131
* Followed up on 'NCCIC Incident number INC000010157486' and created
https://phabricator.wikimedia.org/T185383
* More alerts, also self-inflicted due to meltdown reboots
2018-01-18:
* Granted a few tools membership requests
2018-01-19:
* Cleared out some logfiles on labtestnet2001 in response to an icinga
alert about disk space
2018-01-20:
* Responded to a bitninja email and they actually responded and
whitelisted us! (Or at least said they did)
2018-01-21:
* Granted a few more tools membership requests
2018-01-22:
* Responded to complaints about ssh access failing. This turned out to
be my fault, the result of a bad ldap patch.
* Deleted the 'gitblit' project which has been unused for some time
* Ops meeting
** Full etherpad at
https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/SRE-2…
** Meeting was sparsely-attended and brief
** Received a tip of the hat for finally upgrading everything to puppet 4
** Keith mentioned the puppet-apply vs. Trusty issue but everyone
sounded satisfied with the additional-backport fix
** From etherpad but not discussed: [CLOUD/releng? heads up] Moving
some wikis to s5? T184805 Cloud/releng to provide some feedback
Is tools-static going to be hosed up pretty soon? Not sure what the
solution here is
---------- Forwarded message ----------
From: shinken <shinken(a)shinken-01.shinken.eqiad.wmflabs>
Date: Fri, Jan 19, 2018 at 10:26 AM
Subject: ** PROBLEM alert - tools-static-10/Free space - all mounts is
WARNING **
To: cpettet(a)wikimedia.org
Notification Type: PROBLEM
Service: Free space - all mounts
Host: tools-static-10
Address: 10.68.22.238
State: WARNING
Date/Time: Fri 19 Jan 16:26:22 UTC 2018
Notes URLs:
Additional Info:
WARNING: tools.tools-static-10.diskspace._srv.byte_percentfree (<100.00%)
--
Chase Pettet
chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and
IRC
Traceback (most recent call last):
File "/usr/local/sbin/block_sync", line 125, in <module>
lock_file = open('/var/lock/{}_{}_backup.lock'.format(args.r_vg, args.r_lv, 'w+'))
FileNotFoundError: [Errno 2] No such file or directory: '/var/lock/misc_misc-project_backup.lock'
2018-01-16 20:00:02,777 INFO force is enabled
2018-01-16 20:00:02,806 INFO removing tools-project-backup
2018-01-16 20:00:02,847 INFO removing tools-project-backup
2018-01-16 20:00:03,503 INFO creating tools-project-backup at 2T
2018-01-16 20:00:04,253 INFO force is enabled
2018-01-16 20:00:04,270 INFO removing tools-snap
2018-01-16 20:00:04,321 INFO removing tools-snap
2018-01-16 20:00:05,859 INFO creating tools-snap at 1T
Hello Cloud Admins!
As part of https://phabricator.wikimedia.org/T174569 we have to alter some
big tables.
One of them is logging, which, for instance, in wikidata takes around 8h.
Which is the shard I am currently working on.
Because of the nature of the change (some columns being added) and ROW
based replication (what we use in sanitariums) this change needs to be done
with replication (from sanitarium, or their masters, to the labs servers).
This will obviously generate lag and if not done that way, it will break
replication till the column is added on the labs hosts, and this is less
desirable than replication lag.
I am planning to run the alter probably tomorrow or Monday (I will notify
when I start it) for the sanitarium host in s5, that means that there will
be lag on the labs servers, for a few hours, on the s5 instance (which will
also affect s1 and s3 because we are using the same replication thread for
those shards too - which is a FIXME we have pending).
s2, s4, s6 and s7 will remain unaffected as they have their own replication
thread.
Should you have any questions, let me know!
Thanks
Manuel.