- Cloud-admin - lists.wikimedia.org

Cron <root@labstore2004> /usr/local/sbin/block_sync 10.64.37.20 misc misc-project misc-snap backup misc-project misc-project-backup 2T
by root＠labstore2004.codfw.wmnet 14 Dec '17

14 Dec '17

2017-12-13 20:00:02,312 INFO force is enabled 2017-12-13 20:00:02,337 INFO removing misc-project-backup 2017-12-13 20:00:02,411 INFO removing misc-project-backup 2017-12-13 20:00:02,766 INFO creating misc-project-backup at 2T 2017-12-13 20:00:03,613 INFO force is enabled 2017-12-13 20:00:03,630 INFO removing misc-snap 2017-12-13 20:00:03,664 INFO removing misc-snap 2017-12-13 20:00:04,102 INFO creating misc-snap at 1T

1 0

Kolla and k8s
by Chase Pettet 14 Dec '17

14 Dec '17

https://review.openstack.org/#/c/255450 I joined the openstack-k8s-sig after the interesting converstion in Austin. Seems fairly low key atm, but I did turn up Ebay's use case for deploying openstack on k8s. https://docs.google.com/document/d/15UwgLbEyZyXXxVtsThcSuPiJru4CuqU9p3ttZSf… eBay Cluster Bootstrap Using Kubernetes https://docs.google.com/document/d/1zvgz00IMJkh7kEMhGNuVPiBHzTF8FPo6RWYyFdd… Pretty interesting, actually the director of cloud things was in that panel with me so I could reach out if we have questions. This seems to be very close to our interests. -- Chase Pettet chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and IRC

1 0

Cron <root@labstore2003> /usr/local/sbin/block_sync 10.64.37.20 tools tools-project tools-snap backup tools-project tools-project-backup 2T
by root＠labstore2003.codfw.wmnet 13 Dec '17

13 Dec '17

2017-12-12 20:00:03,035 INFO force is enabled 2017-12-12 20:00:03,059 INFO removing tools-project-backup 2017-12-12 20:00:03,104 INFO removing tools-project-backup 2017-12-12 20:00:04,374 INFO creating tools-project-backup at 2T 2017-12-12 20:00:05,161 INFO force is enabled 2017-12-12 20:00:05,186 INFO removing tools-snap 2017-12-12 20:00:05,220 INFO removing tools-snap 2017-12-12 20:00:07,476 INFO creating tools-snap at 1T

1 0

Kubecon 2017 Austin Feedback
by Chase Pettet 12 Dec '17

12 Dec '17

*Keynotes* * developers at some point should never have to know or care what the backend is. https://github.com/brendandburns/metaparticle :* This is "real infrastructure as code" * subjective: workflow and developer experience is the new frontier. All of this ecosystem is being built not as an end in itself, but rather to describe a platform that enables innovation. "Platforms are about speed" "Kubectl is the new SSH." "You know you are a Sr Engineer when people like you." "We call them soft skills but they are hard to pull off." *---Kelsey Hightower* *Container runtime and image format standards* https://www.opencontainers.org/announcement/2017/07/19/open-container-initi… i.e. OCI is 1.0 This took 2 years and is a further work in progress. The OCI spec has what you should do, what you can do, and also interestingly what you may not do. One of the presenters spoke about this experience vs POSIX which took a decade. Not much here that can't be gleamed from reading the spec and/or news on the inititiative. *Running mixed worklods on Kubernetes at IMHE* https://kccncna17.sched.com/event/CU7z/running-mixed-workloads-on-kubernete… tldr; Univa owns the copyrights to SGE. They have a closed fork having improved on the last SGE release (that we run). They facilitate close-SGE on k8s as a managed service. Most of this was marketing and explanations of IMHE. takeaways; I thought at first that Univa had released their changes back into the world, and it seems their original stated intent back in the day was an opencore model. But alas, no, they are not playing nicely with the greater FLOSS world. Yuvi even asked about this and it was followed by some noncommittal answers about code already on github. Ironically, what is on github is their fork of the original https://github.com/gridengine/gridengine: 'Commits on Jun 1, 2012'. We have seen 'gridengine' packages pop up on Debian Strech and that can give the illusion of project health. The debs in Stretch are "Son of Grid Engine" based on the 6.2 last opensource release as seen at https://arc.liv.ac.uk/trac/SGE/ and this seems like a barely surviving varient: * https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=gridengine-master;dist=un… * http://metadata.ftp-master.debian.org/changelogs/main/g/gridengine/gridengi… * 2016-03-02: Version 8.1.9 available. Note that this changes the communication protocol due to the MUNGE support, and really should have been labelled 8.2 in hindsight — ensure you close down execds before upgrading. Much love to these folks at University of Liverpool but we should double down on the narrative that SGE or SoGE are dead projects. Even if we stay on them into Debian Stretch out of convenience. This model of running a batch style system on k8s is maybe interesting but it's got to be something other than S[0]GE. *Reference* http://www.sdsc.edu/~hocks/FG/MSKCC.slurm.sge.html https://hpc.nih.gov/docs/pbs2slurm.html https://en.wikipedia.org/wiki/Comparison_of_cluster_software http://www.sdsc.edu/~hocks/FG/LL-PBS-SGE.html https://bugs.schedmd.com/show_bug.cgi?id=2208 *More usable k8s* https://kccncna17.sched.com/event/CU8L/the-road-to-more-usable-kubernetes-j… Joe Beda is a super interesting thinker in this space IMO and I went to this mainly because of him. tldr; Heptio has ksonnet (https://github.com/ksonnet) which is a way of thinking about composing infrastructure as code. Kinda-sorta Helm alternative but I think both sides would bristle at that description :) ksonnet seems deeply interesting but a lot of the configuration avoidance I see as convention as configuration. That is lock-in as much as, or more so, than a helm based approach. Totally valid criticisms of Helm "at scale" I imagine, I have used Helm only a bit for personal testing so I'm not entirely sure. This space is young and the urge for these folks to build a DSL is strong. Described Yaml as being an "assembly level primitive" (in regards to Helm). Would like to try out ksonnet a bit more. Not convinced. *Multi-Tenancy Support and Security Modeling with RBAC and Namespaces* https://kccncna17.sched.com/event/CU7j/multi-tenancy-support-security-model… tldr; walk through RBAC personas. The models we want to replace our homebrew in-theory mostly exist. Show off the VMWare UI on top of the k8s native magic. I was hoping for more technical breakdown. Interesting descriptions of the ClusterRole vs NamespaceRole breakdowns and Namespace reasoned isolation. Fits nicely into our model of the world. *CNI, CRI, and OCI "Oh My"* https://kccncna17.sched.com/event/CU6L/cni-cri-and-oci-oh-my-i-elsie-philli… goo.gl/fK8kFS tldr; standars and their origination. The slides are decent. Two community liason type folks from CoreOS talking about AppC being abandoned. Some foundational thinking "What is a container?" "Why do standards exist?" "How is Docker involved?" I have found CNI confusing as far as scope, standard and spec or implementation? So it was mainly an unwinding of trivia acronyms for me. *Local Ephemeral Resource Management* https://kccncna17.sched.com/event/CU7X/local-ephemeral-storage-resource-man… https://github.com/jingxu97 I really liked her style of presentation and clear breakdown of ideas. I think this was a more academic presentation from someone who clearly is in the trenches but I went to get insight into one essential problem: Disk IO QoS and limiting. That was on the last slide labeled "Future" and she said they were determining if it was a "problem worth solving". If we had unlimited money I would hire this person. Mainly talking about quotas and quota setting levels for storage: pod, namespace. Most of this is k8s 1.8 or greater AFAICT. rant: stateful reasoning and resourcing of tenants with sane isolation for storage is this huge elephant in the room in this cloud native world. I noted in the TOC public meeting that the storage sig had become particularly vocal after a period of relative politeness over async channels. I continue to think that resource isolation of storage is the single least solved problem in cloud. Basically, you should tie logical resources for compute and mem to physical resources that are islands to dedicate resourcing. *Prometheus 2.0 "salon"* I think 2.0 is the first release of Prometheus I have seen that looks prod ready. The list of half-punted issues was always too long for me: backups, alerts, rollups, performance, storage. 2.0 is not backwards compatible at all w/ 1.x. The ex-intern-engineer giving the "what's new in 2.0" portion of the talk said to just move to 2.0 and leave old metrics behind. I think for our stuff we should actually do this. That slide deck is not published but most of it is here https://coreos.com/blog/prometheus-2.0-released. The performance improvements are awesome. The storage usage is awesome. Rather than feeling like Prometheus is the best of bad options I think it may actually be...cool as of 2.0. Nice talk about a lot of the nuts and bolts reasoning of Prometheus internals https://schd.ws/hosted_files/kccncna17/c4/KubeCon%20P8s%20Salon%20-%20Kuber…. There were 3 presentations over about an hours and a half. A lot of wisdom on practical applications for tagging and collection. How not to explode cardinality with well-intentioned-but-chaotic-tagging, and that kind of thing. 2.0 has //no downtime backups//. Rule groups are now defined w/ yaml. Worth looking through that presentation and the 2.0 announcement. We have a lot of things to figure out here but it seems the propulsion (of k8s) and investment in Prometheus may have led to something usable...potentially :) I see only <2.0 in Debian atm. https://prometheus.io/blog/2017/11/08/announcing-prometheus-2-0/ https://kccncna17.sched.com/event/Cs4d 'migration' https://prometheus.io/docs/prometheus/latest/migration/ *Openstack and k8s SIG* Background: k8s has the ability to integrate more tightly with an external component. i.e. maybe service IPs are actually managed in Neutron at the openstack layer providing visibility and integration, or Cinder blocks are allocated in OpenStack to be used by k8s. etc. https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/prov… *My impressions and take aways even though it was hard to keep track and I may be wrong:* This despite having been around for awhile is in early stages and IMO the future is unknown. Huawei apparently has been doing some work here since they run a sizeable openstack cloud and are heavily invested in k8s. Who should own CI and integration testing? Where do the resources come from? Integration testing from Mitaka+ and need to certify that HEAD in k8s land does not break existing use cases, and possibly certify certain OS releases for certain k8s releases. It seems k8s upstream wants to decouple all provider code into external libraries to take it out of core and make the projects more independent. Who owns this? "As we all know Neutron is not very self describing" -- Random Dev In This SIG Lots of talk and hijacking on install best practices. It seems there is some consensus in k8s internal circles that kubeadm will be the future across all mediums for k8s deployment. Kubespray was mentioned several times. So that's k8s on openstack. What about openstack on k8s? :D Some are doing it, no one has published significant blogs or use cases. Most openstack devs seem to be using https://github.com/openstack/openstack-ansible which seems like LXC w/o a k8s like scheduler or orchestration layer. Kolla ansible seems to have momentum and be blessed but no one there had much to say about it otherwise. I'm really interested in this area of inquiry but at the very present moment I think operating our entities as ships-in-the-night has a lot of benefit as the tangle of integration runs deep and muddy.

1 0

On-call notes 2017-11-28 - 2017-12-04
by Bryan Davis 12 Dec '17

12 Dec '17

== 2017-11-28 - 2017-12-04 == Tues: * Superyetkin irc about error.log output * Superyetkin irc about user dbs * Created massmessage project Wed: * Paladox spotted a Puppet problem caused by refactoring; Andrew fixed it * Talked to matanya about his collaboration project request * Created collaboration project * Talked to matanya about ssh issues connecting to tools-login (looks like routing) Thur: * Fixed a bug in stashbot that anomie noticed * Superyetkin and https://phabricator.wikimedia.org/P6409 * Krinkle and https://phabricator.wikimedia.org/T181742 * Quota reduction -- https://phabricator.wikimedia.org/T177299 * Pinged for a 2FA removal -- https://phabricator.wikimedia.org/T181475 * Verified and closed https://phabricator.wikimedia.org/T176043 * Made a maintenance script to attach LDAP accounts to wikitech for https://phabricator.wikimedia.org/T180813 * Attached Flominator and set email to match SUL account * Triaged some random bugs in our backlog * Made some traffic report dumps: ** https://phabricator.wikimedia.org/P6413 ** https://phabricator.wikimedia.org/P6414 Fri: * Phab triage * Worked on https://phabricator.wikimedia.org/T171417 Sun: * Crontab encoding -- https://phabricator.wikimedia.org/T181948 Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA irc: bd808 v:415.839.6885 x6855

1 0

Cron <root@labstore2004> /usr/local/sbin/block_sync 10.64.37.20 misc misc-project misc-snap backup misc-project misc-project-backup 2T
by root＠labstore2004.codfw.wmnet 07 Dec '17

07 Dec '17

2017-12-06 20:00:02,969 INFO force is enabled 2017-12-06 20:00:03,016 INFO removing misc-project-backup 2017-12-06 20:00:03,116 INFO removing misc-project-backup 2017-12-06 20:00:03,763 INFO creating misc-project-backup at 2T 2017-12-06 20:00:04,629 INFO force is enabled 2017-12-06 20:00:04,674 INFO removing misc-snap 2017-12-06 20:00:04,736 INFO removing misc-snap 2017-12-06 20:00:05,022 INFO creating misc-snap at 1T

1 0

Cron <root@labstore2003> /usr/local/sbin/block_sync 10.64.37.20 tools tools-project tools-snap backup tools-project tools-project-backup 2T
by root＠labstore2003.codfw.wmnet 07 Dec '17

07 Dec '17

2017-12-05 20:00:03,140 INFO force is enabled 2017-12-05 20:00:03,170 INFO removing tools-project-backup 2017-12-05 20:00:03,212 INFO removing tools-project-backup 2017-12-05 20:00:03,719 INFO creating tools-project-backup at 2T 2017-12-05 20:00:04,439 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n' 2017-12-05 20:00:04,440 INFO force is enabled 2017-12-05 20:00:04,496 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n' 2017-12-05 20:00:04,497 INFO removing tools-snap 2017-12-05 20:00:04,513 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n' 2017-12-05 20:00:04,530 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n' 2017-12-05 20:00:04,531 INFO removing tools-snap 2017-12-05 20:00:06,107 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n' 2017-12-05 20:00:06,152 INFO creating tools-snap at 1T

1 0

reinvent
by Chase Pettet 02 Dec '17

02 Dec '17

I have met one of the hosts of this and they all seem like nice people, nice breakdown from reinvent @ http://www.softwaredefinedtalk.com/113 AWS really is the 800 pound gorilla in the room. Also interesting http://www.computerweekly.com/feature/Redefining-OpenStack-Addressing-the-i… -- Chase Pettet chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and IRC

1 0

Cron <root@labstore2004> /usr/local/sbin/block_sync 10.64.37.20 misc misc-project misc-snap backup misc-project misc-project-backup 2T
by root＠labstore2004.codfw.wmnet 30 Nov '17

30 Nov '17

2017-11-29 20:00:02,315 INFO force is enabled 2017-11-29 20:00:02,357 INFO removing misc-project-backup 2017-11-29 20:00:02,456 INFO removing misc-project-backup 2017-11-29 20:00:03,196 INFO creating misc-project-backup at 2T 2017-11-29 20:00:03,960 INFO force is enabled 2017-11-29 20:00:04,009 INFO removing misc-snap 2017-11-29 20:00:04,060 INFO removing misc-snap 2017-11-29 20:00:04,532 INFO creating misc-snap at 1T

1 0

Changes on database infrastructure that could break hardcoded utilities (s8, multiinstance)
by Jaime Crespo 30 Nov '17

30 Nov '17

Most people attending meeting with ops or SOS will be familiar with it, but on this quarter we will have some potentially breaking changes on the mediawiki databases: * T177208: We have scheduled for the 9th January moving wikidata from "s5" to "s8" to provide dedicated resources to it. dewiki will stay at "s5". Config is already prepared: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php but it will not be effective until that day, when we will have a read-only period to perform the split. Most of mediawiki should be ok with it, but there could be scripts hardcoding wikidata is on s5 (specially on toolforge scripts- we should coordinate an announcement with cloud there) * T178359: We now have production multi-instance hosts (meaning multiple instances per physical host). That means that mysql will not be available only on port 3306 (default), you may have to indicate a specific socket (mysql --skip-ssl --socket=/run/mysqld/mysqld.s1.sock) or port (-P3311). You can see the list of hosts on tendril or grafana. Having this on an email can become handy in case of emergency. Cheers, -- Jaime Crespo <http://wikimedia.org>

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin