March 2018 - Cloud-admin - lists.wikimedia.org

Kubernetes vulnerabilities
by Alexandros Kosiaris 14 Mar '18

14 Mar '18

Hello, Kubernetes released information about 2 vulnerabilities that allow direct host manipulation. I 've patched already the production and staging clusters. Tools however, running an old version of kubernetes is possibly vulnerable. I 've filed https://phabricator.wikimedia.org/T189680 to track this. -- Alexandros Kosiaris <akosiaris(a)wikimedia.org>

2 1

Cron <root@labstore2003> /usr/local/sbin/block_sync 10.64.37.20 tools tools-project tools-snap backup tools-project tools-project-backup 2T
by root＠labstore2003.codfw.wmnet 14 Mar '18

14 Mar '18

2018-03-13 20:00:02,404 INFO force is enabled 2018-03-13 20:00:02,446 INFO removing tools-project-backup 2018-03-13 20:00:02,504 INFO removing tools-project-backup 2018-03-13 20:00:03,004 INFO creating tools-project-backup at 2T 2018-03-13 20:00:03,779 INFO force is enabled 2018-03-13 20:00:03,806 INFO removing tools-snap 2018-03-13 20:00:03,883 INFO removing tools-snap 2018-03-13 20:00:05,262 INFO creating tools-snap at 1T

1 0

wikitech-static /etc/hosts entry
by Chase Pettet 13 Mar '18

13 Mar '18

In case anyone on this list is missing this info: https://phabricator.wikimedia.org/T164290#4046628 -- Chase Pettet chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and IRC

3 2

Ops meeting/clinic duty update
by Madhumitha Viswanathan 13 Mar '18

13 Mar '18

Hey folks, >From the ops meeting yesterday: * Pinged Rob about labvirt1019|20 - his update is at https://phabricator.wikimedia.org/T187373#4043737 * The meeting was cut off at 30 minutes and forked into Q4 goal discussion - see https://etherpad.wikimedia.org/p/SRE-goals-FQ4-FY1718 ** DBA proposed goal "Sanitarium multisource migration to multi-instance" - probably will have some downtime/lag for the wiki replicas, but no work required from us. Clinic Duty update * 1:1 with arturo on apt package upgrades * Approved tools membership requests * Helped prtxsna (wmf design) with taking over the design vps project and basic vps questions * Tools ldap config incident https://phabricator.wikimedia.org/T187373 * Responded on https://phabricator.wikimedia.org/T159930 (custom instance flavor for Dumps project) * Shout out to Chico and Yifei for their awesome support for the WCDO project! https://phabricator.wikimedia.org/T189165 * Helped with /data/scratch not accessible issue in tools on Monday - https://phabricator.wikimedia.org/T189018#4044066 Best, -- Madhumitha Viswanathan Operations Engineer, Cloud Services

1 0

Re: [Cloud-admin] [Phame] [Updated] Running red-queen-style
by Chase Pettet 10 Mar '18

10 Mar '18

Andrew, this was a great post. I already knew much of this and I enjoyed it immensely. On Mar 9, 2018 5:42 PM, "Andrew" <no-reply(a)phabricator.wikimedia.org> wrote: Andrew published this post. I've spent the last few months building new web servers to support some of the basic WMCS web services: Wikitech, Horizon, and Toolsadmin. The new Wikitech service is already up and running; on Wednesday I hope to flip the last switch and move all public Horizon and Toolsadmin traffic to the new servers as well. If everything goes as planned, users will barely notice this change at all. This is a lot of what our team does -- running as fast as we can just to stay in place. Software doesn't last forever -- it takes a lot of effort just to hold things together. Here are some of the problems that this rebuild is solving: - T186288 <https://phabricator.wikimedia.org/T186288>: *Operating System obsolescence*. Years ago, the Wikimedia Foundation Operations team resolved to move all of our infrastructure from Ubuntu to Debian Linux. Ubuntu Trusty will stop receiving security upgrades in about a year, so we have to stop using it by then. All three services (Wikitech, Horizon, Toolsadmin) were running on Ubuntu servers; Wikitech was the last of the Foundation's MediaWiki hosts to run on Ubuntu, so its upgrade should allow for all kinds of special cases to be ignored in the future. - T98813 <https://phabricator.wikimedia.org/T98813>: *Keeping up with PHP and HHVM*. In addition to being the last wiki on Trusty, Wikitech was also the last wiki on PHP 5. Every other wiki is using HHVM and, with the death of the old Wikitech, we can finally stop supporting PHP 5 internally. Better yet, this plays a part in unblocking the entire MediaWiki ecosystem (T172165 <https://phabricator.wikimedia.org/T172165>) as newer versions of MediaWiki standardize on HHVM or PHP 7. - T168559 <https://phabricator.wikimedia.org/T168559>: *Escaping failing hardware*. The old Wikitech site was hosted on a machine named 'Silver'. Hardware wears out, and Silver is pretty old. The last few times I've rebooted it, it's required a bit of nudging to bring it back up. If it powered down today, it would probably come back, but it might not. As of today's switchover, that scenario won't result in weeks of Wikitech downtime. - T169099 <https://phabricator.wikimedia.org/T169099>: *Tracking OpenStack upgrades*. OpenStack (the software project that includes Horizon and most of our virtual machine infrastructure) releases a new version every six months. Ubuntu packages up every version with all of its dependencies, and provides a clear upgrade path between versions. Debian, for the most part, does not. The new release of Horizon is no longer deployed through an upstream package at all, but instead is a pure Python deploy starting with the raw Horizon source and requirements list, rolled into Wheels and deployed into an isolated virtual environment. It's unclear exactly how we'll transition our other OpenStack components away from Ubuntu, but this Horizon deploy provides a potential model for deploying any OpenStack project, any version, on any OS. Having done this I'm much less worried about our reliance on often-fickle upstream packagers. - T187506 <https://phabricator.wikimedia.org/T187506>: *High availability*. The old versions of these web services were hosted on single servers. Any maintenance or hardware downtime meant that the websites were gone for the duration. Now we have a pair of servers with a shared cache, behind a load-balancer. If either of the servers dies (or, more likely, we need to reboot one for kernel updates) the website will remain up and responsive. Of course, having just moved wikitech to HHVM, the main Wikimedia cluster is being upgraded from HHVM to PHP 7, and Wikitech will soon follow suit. The websites look the same, but the race never ends. *POST DETAIL* https://phabricator.wikimedia.org/phame/post/view/87/ running_red-queen-style/ *EMAIL PREFERENCES* https://phabricator.wikimedia.org/settings/panel/emailpreferences/ *To: *Andrew, TerraCodes, Dispenser, Se4598, tom29739, zhuyifei1999, Papuass, Sebastian_Berlin-WMSE, Niedzielski, chasemp, MZMcBride, Arlolra, KartikMistry, Quiddity, PeterBowman, fgiunchedi, Mholloway, Bamyers99, MaxSem, Krenair, Giftpflanze, Ladsgroup, bd808, Jay8g, Dalba, madhuvishy, Framawiki, greg, zeljkofilipin, mmodell, yuvipanda *Cc: *Andrew, Quiddity, 1978Gage2001, aborrero, Chicocvenancio, Tbscho, Freddy2001, JJMC89, srodlund, Luke081515, Gryllida, jayvdb, zhuyifei1999, scfc, coren, Jay8g, bd808, Krenair, chasemp

1 0

Cron <root@labstore2004> /usr/local/sbin/block_sync 10.64.37.20 misc misc-project misc-snap backup misc-project misc-project-backup 2T
by root＠labstore2004.codfw.wmnet 08 Mar '18

08 Mar '18

2018-03-07 21:00:02,824 INFO force is enabled 2018-03-07 21:00:02,880 INFO removing misc-project-backup 2018-03-07 21:00:02,980 INFO removing misc-project-backup 2018-03-07 21:00:03,827 INFO creating misc-project-backup at 2T 2018-03-07 21:00:04,690 INFO force is enabled 2018-03-07 21:00:04,735 INFO removing misc-snap 2018-03-07 21:00:04,778 INFO removing misc-snap 2018-03-07 21:00:05,273 INFO creating misc-snap at 1T

1 0

Cron <root@labstore2004> /usr/local/sbin/block_sync 10.64.37.20 misc misc-project misc-snap backup misc-project misc-project-backup 2T
by root＠labstore2004.codfw.wmnet 07 Mar '18

07 Mar '18

TARGET="/srv/backup/misc" SOURCE="/dev/mapper/backup-misc--project" FSTYPE="ext4" OPTIONS="rw,relatime,data=ordered" 2018-03-07 20:00:01,283 ERROR Local device is mounted. Operations may be unsafe

2 1

2018-03-06 toolforge incident report
by Arturo Borrero Gonzalez 07 Mar '18

07 Mar '18

Hi, I'm not sure if the outage from yesterday qualifies for an official incident report. Let me know. Anyway, I will write here a report of what happened for future reference. Timeline: * 2018-03-06 12:58Z arturo doing package upgrades in toolforge (jessie machines) with clush [0]. All operations are logged in SAL [1]. * 2018-03-06 13:21Z some upgrades failed because a debconf prompt. The debconf prompt happened because not using DEBIAN_FRONTEND=noninteractive This resulted in stalled dpkg operations. Also, there were clashes with puppet apt operations * 2018-03-06 13:21Z arturo killed stalled dpkg procs in toolforge and reconfigured affected packages. There are 2 important affected packages: libnss-ldap and sudo-ldap * 2018-03-06 13:32Z users reports of some tools in toolforge misbehaving via IRC and phabricator [2][3][4], we start investigating (arturo and chico) * 2018-03-06 13:38Z firsts investigations are directed towards DB issues, so the DBA team is contacted. They confirm all is working fine in their side. * 2018-03-06 14:07Z chase arrives the scene and start investigating. NFS clients issues are detected. Users can't read files in their home directories in toolforge. * 2018-03-06 14:23Z chasemp downtimes icinga alert for k8s workers * 2018-03-06 15:13Z Andrew, Madhu, Bryan and Brook come to give some helping hand. * 2018-03-06 15:21Z tracking task is created in phabricator [5]. By this time, is more than clear that the issue is related to the earlier package upgrades. * 2018-03-06 15:27Z some toolforge servers are rebooted. It is suggested we start rebuilding part of the cluster. * 2018-03-06 15:57Z Madhu reports that nscd restart + nscd cache flush + machine reboot + puppet run can get servers back into good state. * 2018-03-06 16:21Z All systems are back to normal state. Notes: * Package upgrade operations were carried after several previous tests in a list of canary servers [6]. * The need for DEBIAN_FRONTEND=noninteractive was already known, but a human error was produced (arturo forgot to use it when doing upgrades) * Package pinning for nss/ldap/pam packages are in place, but is not enough. We need a *complete* freeze of these packages. * This evidences our upgrade workflow [7] is not ready for wide usage and needs more development. Conclusions: * extend apt pinning for more nss/ldap/pam packages * study implementing apt holds for those packages by puppet? * embed DEBIAN_FRONTEND=noninteractive into apt-upgrade script * better integration of apt-upgrade with other apt operations (specially puppet). Perhaps we could auto disable puppet from within apt-upgrade script while in operations Data: * Affected tools in toolforge: at least 3 [2][3][4] * Amount of downtime: 3h (13:21Z --> 16:21Z) Related links: [0] toolforge: package upgrades as part of the new workflow https://phabricator.wikimedia.org/T188994 [1] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [2] Global User Contributions complains about replica conf file https://phabricator.wikimedia.org/T189001 [3] Connection Error at OrphanTalk Tools https://phabricator.wikimedia.org/T188998 [4] Tool https://tools.wmflabs.org/replag/ is reported via IRC [5] Toolforge Iinstances (maybe only Jessie?) are having issues with NFS/LDAP https://phabricator.wikimedia.org/T189018 [6] https://etherpad.wikimedia.org/p/toolforge-upgrades [7] create 'attended' upgrade workflow for cloud with Toolforge as canonical case https://phabricator.wikimedia.org/T181647

1 0

Cron <root@labstore2003> /usr/local/sbin/block_sync 10.64.37.20 tools tools-project tools-snap backup tools-project tools-project-backup 2T
by root＠labstore2003.codfw.wmnet 07 Mar '18

07 Mar '18

2018-03-06 20:00:02,527 INFO force is enabled 2018-03-06 20:00:02,577 INFO removing tools-project-backup 2018-03-06 20:00:02,677 INFO removing tools-project-backup 2018-03-06 20:00:03,109 INFO creating tools-project-backup at 2T 2018-03-06 20:00:03,894 INFO force is enabled 2018-03-06 20:00:03,928 INFO removing tools-snap 2018-03-06 20:00:03,969 INFO removing tools-snap 2018-03-06 20:00:05,199 INFO creating tools-snap at 1T

1 0

Clinic 2018-02-27 - 2018-03-06
by Chase Pettet 07 Mar '18

07 Mar '18

This was an eventful week as evidenced by https://lists.wikimedia.org/pipermail/cloud-admin-feed/2018-March/date.html :) Thanks to Chico for IRC support. You rock. * @andrew there is coming hiera updates and such (which I know you know) * Pages and incidents mentioned in the weekly ** Andrew will merge https://gerrit.wikimedia.org/r/c/415619/ this week * New projects and quota updates ** https://gerrit.wikimedia.org/r/c/415619/ ** https://phabricator.wikimedia.org/T188300 * Misc reported/user issues: ** https://phabricator.wikimedia.org/T188500 ** https://phabricator.wikimedia.org/T189003 ** https://phabricator.wikimedia.org/T188508 (Thanks Madhu!) * NOTICE: foo-wmcs 24/7 contacts vs awake hours In private: > commit aab9fa15370dcebf2b2b02edf228316eb2e06bc4 > Author: gitpuppet for private repo <git(a)puppetmaster1001.eqiad.wmnet> > Date: Tue Mar 6 21:16:21 2018 +0000 > > setup wmcs contacts for madhu/brooke/andrew/arturo In public: https://gerrit.wikimedia.org/r/c/416844/ -- Chase Pettet chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and IRC

1 1

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin March 2018