Hello,
Kubernetes released information about 2 vulnerabilities that allow
direct host manipulation. I 've patched already the production and
staging clusters. Tools however, running an old version of kubernetes
is possibly vulnerable. I 've filed
https://phabricator.wikimedia.org/T189680 to track this.
--
Alexandros Kosiaris <akosiaris(a)wikimedia.org>
2018-03-13 20:00:02,404 INFO force is enabled
2018-03-13 20:00:02,446 INFO removing tools-project-backup
2018-03-13 20:00:02,504 INFO removing tools-project-backup
2018-03-13 20:00:03,004 INFO creating tools-project-backup at 2T
2018-03-13 20:00:03,779 INFO force is enabled
2018-03-13 20:00:03,806 INFO removing tools-snap
2018-03-13 20:00:03,883 INFO removing tools-snap
2018-03-13 20:00:05,262 INFO creating tools-snap at 1T
Hey folks,
>From the ops meeting yesterday:
* Pinged Rob about labvirt1019|20 - his update is at
https://phabricator.wikimedia.org/T187373#4043737
* The meeting was cut off at 30 minutes and forked into Q4 goal discussion
- see https://etherpad.wikimedia.org/p/SRE-goals-FQ4-FY1718
** DBA proposed goal "Sanitarium multisource migration to multi-instance" -
probably will have some downtime/lag for the wiki replicas, but no work
required from us.
Clinic Duty update
* 1:1 with arturo on apt package upgrades
* Approved tools membership requests
* Helped prtxsna (wmf design) with taking over the design vps project and
basic vps questions
* Tools ldap config incident https://phabricator.wikimedia.org/T187373
* Responded on https://phabricator.wikimedia.org/T159930 (custom instance
flavor for Dumps project)
* Shout out to Chico and Yifei for their awesome support for the WCDO
project! https://phabricator.wikimedia.org/T189165
* Helped with /data/scratch not accessible issue in tools on Monday -
https://phabricator.wikimedia.org/T189018#4044066
Best,
--
Madhumitha Viswanathan
Operations Engineer, Cloud Services
Andrew, this was a great post. I already knew much of this and I enjoyed
it immensely.
On Mar 9, 2018 5:42 PM, "Andrew" <no-reply(a)phabricator.wikimedia.org> wrote:
Andrew published this post.
I've spent the last few months building new web servers to support some of
the basic WMCS web services: Wikitech, Horizon, and Toolsadmin. The new
Wikitech service is already up and running; on Wednesday I hope to flip the
last switch and move all public Horizon and Toolsadmin traffic to the new
servers as well.
If everything goes as planned, users will barely notice this change at all.
This is a lot of what our team does -- running as fast as we can just to
stay in place. Software doesn't last forever -- it takes a lot of effort
just to hold things together. Here are some of the problems that this
rebuild is solving:
- T186288 <https://phabricator.wikimedia.org/T186288>: *Operating System
obsolescence*. Years ago, the Wikimedia Foundation Operations team
resolved to move all of our infrastructure from Ubuntu to Debian Linux.
Ubuntu Trusty will stop receiving security upgrades in about a year, so we
have to stop using it by then. All three services (Wikitech, Horizon,
Toolsadmin) were running on Ubuntu servers; Wikitech was the last of the
Foundation's MediaWiki hosts to run on Ubuntu, so its upgrade should allow
for all kinds of special cases to be ignored in the future.
- T98813 <https://phabricator.wikimedia.org/T98813>: *Keeping up with
PHP and HHVM*. In addition to being the last wiki on Trusty, Wikitech
was also the last wiki on PHP 5. Every other wiki is using HHVM and, with
the death of the old Wikitech, we can finally stop supporting PHP 5
internally. Better yet, this plays a part in unblocking the entire
MediaWiki ecosystem (T172165 <https://phabricator.wikimedia.org/T172165>)
as newer versions of MediaWiki standardize on HHVM or PHP 7.
- T168559 <https://phabricator.wikimedia.org/T168559>: *Escaping failing
hardware*. The old Wikitech site was hosted on a machine named 'Silver'.
Hardware wears out, and Silver is pretty old. The last few times I've
rebooted it, it's required a bit of nudging to bring it back up. If it
powered down today, it would probably come back, but it might not. As of
today's switchover, that scenario won't result in weeks of Wikitech
downtime.
- T169099 <https://phabricator.wikimedia.org/T169099>: *Tracking
OpenStack upgrades*. OpenStack (the software project that includes
Horizon and most of our virtual machine infrastructure) releases a new
version every six months. Ubuntu packages up every version with all of its
dependencies, and provides a clear upgrade path between versions. Debian,
for the most part, does not. The new release of Horizon is no longer
deployed through an upstream package at all, but instead is a pure Python
deploy starting with the raw Horizon source and requirements list, rolled
into Wheels and deployed into an isolated virtual environment. It's unclear
exactly how we'll transition our other OpenStack components away from
Ubuntu, but this Horizon deploy provides a potential model for deploying
any OpenStack project, any version, on any OS. Having done this I'm much
less worried about our reliance on often-fickle upstream packagers.
- T187506 <https://phabricator.wikimedia.org/T187506>: *High
availability*. The old versions of these web services were hosted on
single servers. Any maintenance or hardware downtime meant that the
websites were gone for the duration. Now we have a pair of servers with a
shared cache, behind a load-balancer. If either of the servers dies (or,
more likely, we need to reboot one for kernel updates) the website will
remain up and responsive.
Of course, having just moved wikitech to HHVM, the main Wikimedia cluster
is being upgraded from HHVM to PHP 7, and Wikitech will soon follow suit.
The websites look the same, but the race never ends.
*POST DETAIL*
https://phabricator.wikimedia.org/phame/post/view/87/
running_red-queen-style/
*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
*To: *Andrew, TerraCodes, Dispenser, Se4598, tom29739, zhuyifei1999,
Papuass, Sebastian_Berlin-WMSE, Niedzielski, chasemp, MZMcBride, Arlolra,
KartikMistry, Quiddity, PeterBowman, fgiunchedi, Mholloway, Bamyers99,
MaxSem, Krenair, Giftpflanze, Ladsgroup, bd808, Jay8g, Dalba, madhuvishy,
Framawiki, greg, zeljkofilipin, mmodell, yuvipanda
*Cc: *Andrew, Quiddity, 1978Gage2001, aborrero, Chicocvenancio, Tbscho,
Freddy2001, JJMC89, srodlund, Luke081515, Gryllida, jayvdb, zhuyifei1999,
scfc, coren, Jay8g, bd808, Krenair, chasemp
2018-03-07 21:00:02,824 INFO force is enabled
2018-03-07 21:00:02,880 INFO removing misc-project-backup
2018-03-07 21:00:02,980 INFO removing misc-project-backup
2018-03-07 21:00:03,827 INFO creating misc-project-backup at 2T
2018-03-07 21:00:04,690 INFO force is enabled
2018-03-07 21:00:04,735 INFO removing misc-snap
2018-03-07 21:00:04,778 INFO removing misc-snap
2018-03-07 21:00:05,273 INFO creating misc-snap at 1T
TARGET="/srv/backup/misc" SOURCE="/dev/mapper/backup-misc--project" FSTYPE="ext4" OPTIONS="rw,relatime,data=ordered"
2018-03-07 20:00:01,283 ERROR Local device is mounted. Operations may be unsafe
Hi,
I'm not sure if the outage from yesterday qualifies for an official
incident report. Let me know. Anyway, I will write here a report of what
happened for future reference.
Timeline:
* 2018-03-06 12:58Z arturo doing package upgrades in toolforge (jessie
machines) with clush [0]. All operations are logged in SAL [1].
* 2018-03-06 13:21Z some upgrades failed because a debconf prompt. The
debconf prompt happened because not using DEBIAN_FRONTEND=noninteractive
This resulted in stalled dpkg operations. Also, there were clashes with
puppet apt operations
* 2018-03-06 13:21Z arturo killed stalled dpkg procs in toolforge and
reconfigured affected packages. There are 2 important affected packages:
libnss-ldap and sudo-ldap
* 2018-03-06 13:32Z users reports of some tools in toolforge misbehaving
via IRC and phabricator [2][3][4], we start investigating (arturo and
chico)
* 2018-03-06 13:38Z firsts investigations are directed towards DB
issues, so the DBA team is contacted. They confirm all is working fine
in their side.
* 2018-03-06 14:07Z chase arrives the scene and start investigating. NFS
clients issues are detected. Users can't read files in their home
directories in toolforge.
* 2018-03-06 14:23Z chasemp downtimes icinga alert for k8s workers
* 2018-03-06 15:13Z Andrew, Madhu, Bryan and Brook come to give some
helping hand.
* 2018-03-06 15:21Z tracking task is created in phabricator [5]. By this
time, is more than clear that the issue is related to the earlier
package upgrades.
* 2018-03-06 15:27Z some toolforge servers are rebooted. It is suggested
we start rebuilding part of the cluster.
* 2018-03-06 15:57Z Madhu reports that nscd restart + nscd cache flush +
machine reboot + puppet run can get servers back into good state.
* 2018-03-06 16:21Z All systems are back to normal state.
Notes:
* Package upgrade operations were carried after several previous tests
in a list of canary servers [6].
* The need for DEBIAN_FRONTEND=noninteractive was already known, but a
human error was produced (arturo forgot to use it when doing upgrades)
* Package pinning for nss/ldap/pam packages are in place, but is not
enough. We need a *complete* freeze of these packages.
* This evidences our upgrade workflow [7] is not ready for wide usage
and needs more development.
Conclusions:
* extend apt pinning for more nss/ldap/pam packages
* study implementing apt holds for those packages by puppet?
* embed DEBIAN_FRONTEND=noninteractive into apt-upgrade script
* better integration of apt-upgrade with other apt operations (specially
puppet). Perhaps we could auto disable puppet from within apt-upgrade
script while in operations
Data:
* Affected tools in toolforge: at least 3 [2][3][4]
* Amount of downtime: 3h (13:21Z --> 16:21Z)
Related links:
[0] toolforge: package upgrades as part of the new workflow
https://phabricator.wikimedia.org/T188994
[1] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[2] Global User Contributions complains about replica conf file
https://phabricator.wikimedia.org/T189001
[3] Connection Error at OrphanTalk Tools
https://phabricator.wikimedia.org/T188998
[4] Tool https://tools.wmflabs.org/replag/ is reported via IRC
[5] Toolforge Iinstances (maybe only Jessie?) are having issues with
NFS/LDAP https://phabricator.wikimedia.org/T189018
[6] https://etherpad.wikimedia.org/p/toolforge-upgrades
[7] create 'attended' upgrade workflow for cloud with Toolforge as
canonical case https://phabricator.wikimedia.org/T181647
2018-03-06 20:00:02,527 INFO force is enabled
2018-03-06 20:00:02,577 INFO removing tools-project-backup
2018-03-06 20:00:02,677 INFO removing tools-project-backup
2018-03-06 20:00:03,109 INFO creating tools-project-backup at 2T
2018-03-06 20:00:03,894 INFO force is enabled
2018-03-06 20:00:03,928 INFO removing tools-snap
2018-03-06 20:00:03,969 INFO removing tools-snap
2018-03-06 20:00:05,199 INFO creating tools-snap at 1T