*Keynotes*
* developers at some point should never have to know or care what the
backend is. https://github.com/brendandburns/metaparticle
:* This is "real infrastructure as code"
* subjective: workflow and developer experience is the new frontier. All
of this ecosystem is being built not as an end in itself, but rather to
describe a platform that enables innovation.
"Platforms are about speed"
"Kubectl is the new SSH."
"You know you are a Sr Engineer when people like you."
"We call them soft skills but they are hard to pull off."
*---Kelsey Hightower*
*Container runtime and image format standards*
https://www.opencontainers.org/announcement/2017/07/19/open-container-initi…
i.e. OCI is 1.0
This took 2 years and is a further work in progress. The OCI spec has what
you should do, what you can do, and also interestingly what you may not
do. One of the presenters spoke about this experience vs POSIX which took
a decade. Not much here that can't be gleamed from reading the spec and/or
news on the inititiative.
*Running mixed worklods on Kubernetes at IMHE*
https://kccncna17.sched.com/event/CU7z/running-mixed-workloads-on-kubernete…
tldr; Univa owns the copyrights to SGE. They have a closed fork having
improved on the last SGE release (that we run). They facilitate close-SGE
on k8s as a managed service. Most of this was marketing and explanations
of IMHE.
takeaways; I thought at first that Univa had released their changes back
into the world, and it seems their original stated intent back in the day
was an opencore model. But alas, no, they are not playing nicely with the
greater FLOSS world. Yuvi even asked about this and it was followed by
some noncommittal answers about code already on github. Ironically, what
is on github is their fork of the original
https://github.com/gridengine/gridengine: 'Commits on Jun 1, 2012'. We
have seen 'gridengine' packages pop up on Debian Strech and that can give
the illusion of project health. The debs in Stretch are "Son of Grid
Engine" based on the 6.2 last opensource release as seen at
https://arc.liv.ac.uk/trac/SGE/ and this seems like a barely surviving
varient:
*
https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=gridengine-master;dist=un…
*
http://metadata.ftp-master.debian.org/changelogs/main/g/gridengine/gridengi…
* 2016-03-02: Version 8.1.9 available. Note that this changes the
communication protocol due to the MUNGE support, and really should have
been labelled 8.2 in hindsight — ensure you close down execds before
upgrading.
Much love to these folks at University of Liverpool but we should double
down on the narrative that SGE or SoGE are dead projects. Even if we stay
on them into Debian Stretch out of convenience. This model of running a
batch style system on k8s is maybe interesting but it's got to be something
other than S[0]GE.
*Reference*
http://www.sdsc.edu/~hocks/FG/MSKCC.slurm.sge.htmlhttps://hpc.nih.gov/docs/pbs2slurm.htmlhttps://en.wikipedia.org/wiki/Comparison_of_cluster_softwarehttp://www.sdsc.edu/~hocks/FG/LL-PBS-SGE.htmlhttps://bugs.schedmd.com/show_bug.cgi?id=2208
*More usable k8s*
https://kccncna17.sched.com/event/CU8L/the-road-to-more-usable-kubernetes-j…
Joe Beda is a super interesting thinker in this space IMO and I went to
this mainly because of him.
tldr; Heptio has ksonnet (https://github.com/ksonnet) which is a way of
thinking about composing infrastructure as code. Kinda-sorta Helm
alternative but I think both sides would bristle at that description :)
ksonnet seems deeply interesting but a lot of the configuration avoidance I
see as convention as configuration. That is lock-in as much as, or more so,
than a helm based approach. Totally valid criticisms of Helm "at scale" I
imagine, I have used Helm only a bit for personal testing so I'm not
entirely sure. This space is young and the urge for these folks to build a
DSL is strong.
Described Yaml as being an "assembly level primitive" (in regards to
Helm). Would like to try out ksonnet a bit more. Not convinced.
*Multi-Tenancy Support and Security Modeling with RBAC and Namespaces*
https://kccncna17.sched.com/event/CU7j/multi-tenancy-support-security-model…
tldr; walk through RBAC personas. The models we want to replace our
homebrew in-theory mostly exist. Show off the VMWare UI on top of the k8s
native magic. I was hoping for more technical breakdown. Interesting
descriptions of the ClusterRole vs NamespaceRole breakdowns and Namespace
reasoned isolation. Fits nicely into our model of the world.
*CNI, CRI, and OCI "Oh My"*
https://kccncna17.sched.com/event/CU6L/cni-cri-and-oci-oh-my-i-elsie-philli…
goo.gl/fK8kFS
tldr; standars and their origination. The slides are decent. Two
community liason type folks from CoreOS talking about AppC being
abandoned. Some foundational thinking "What is a container?" "Why do
standards exist?" "How is Docker involved?"
I have found CNI confusing as far as scope, standard and spec or
implementation? So it was mainly an unwinding of trivia acronyms for me.
*Local Ephemeral Resource Management*
https://kccncna17.sched.com/event/CU7X/local-ephemeral-storage-resource-man…https://github.com/jingxu97
I really liked her style of presentation and clear breakdown of ideas. I
think this was a more academic presentation from someone who clearly is in
the trenches but I went to get insight into one essential problem: Disk IO
QoS and limiting. That was on the last slide labeled "Future" and she said
they were determining if it was a "problem worth solving". If we had
unlimited money I would hire this person.
Mainly talking about quotas and quota setting levels for storage: pod,
namespace. Most of this is k8s 1.8 or greater AFAICT.
rant: stateful reasoning and resourcing of tenants with sane isolation for
storage is this huge elephant in the room in this cloud native world. I
noted in the TOC public meeting that the storage sig had become
particularly vocal after a period of relative politeness over async
channels. I continue to think that resource isolation of storage is the
single least solved problem in cloud. Basically, you should tie logical
resources for compute and mem to physical resources that are islands to
dedicate resourcing.
*Prometheus 2.0 "salon"*
I think 2.0 is the first release of Prometheus I have seen that looks prod
ready. The list of half-punted issues was always too long for me: backups,
alerts, rollups, performance, storage. 2.0 is not backwards compatible at
all w/ 1.x. The ex-intern-engineer giving the "what's new in 2.0" portion
of the talk said to just move to 2.0 and leave old metrics behind. I think
for our stuff we should actually do this. That slide deck is not published
but most of it is here https://coreos.com/blog/prometheus-2.0-released.
The performance improvements are awesome. The storage usage is awesome.
Rather than feeling like Prometheus is the best of bad options I think it
may actually be...cool as of 2.0. Nice talk about a lot of the nuts and
bolts reasoning of Prometheus internals
https://schd.ws/hosted_files/kccncna17/c4/KubeCon%20P8s%20Salon%20-%20Kuber….
There were 3 presentations over about an hours and a half. A lot of wisdom
on practical applications for tagging and collection. How not to explode
cardinality with well-intentioned-but-chaotic-tagging, and that kind of
thing. 2.0 has //no downtime backups//. Rule groups are now defined w/
yaml. Worth looking through that presentation and the 2.0 announcement. We
have a lot of things to figure out here but it seems the propulsion (of
k8s) and investment in Prometheus may have led to something
usable...potentially :) I see only <2.0 in Debian atm.
https://prometheus.io/blog/2017/11/08/announcing-prometheus-2-0/https://kccncna17.sched.com/event/Cs4d
'migration'
https://prometheus.io/docs/prometheus/latest/migration/
*Openstack and k8s SIG*
Background: k8s has the ability to integrate more tightly with an external
component. i.e. maybe service IPs are actually managed in Neutron at the
openstack layer providing visibility and integration, or Cinder blocks are
allocated in OpenStack to be used by k8s. etc.
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/prov…
*My impressions and take aways even though it was hard to keep track and I
may be wrong:*
This despite having been around for awhile is in early stages and IMO the
future is unknown. Huawei apparently has been doing some work here since
they run a sizeable openstack cloud and are heavily invested in k8s. Who
should own CI and integration testing? Where do the resources come from?
Integration testing from Mitaka+ and need to certify that HEAD in k8s land
does not break existing use cases, and possibly certify certain OS releases
for certain k8s releases. It seems k8s upstream wants to decouple all
provider code into external libraries to take it out of core and make the
projects more independent. Who owns this?
"As we all know Neutron is not very self describing" -- Random Dev In This
SIG
Lots of talk and hijacking on install best practices. It seems there is
some consensus in k8s internal circles that kubeadm will be the future
across all mediums for k8s deployment. Kubespray was mentioned several
times. So that's k8s on openstack. What about openstack on k8s? :D Some
are doing it, no one has published significant blogs or use cases. Most
openstack devs seem to be using
https://github.com/openstack/openstack-ansible which seems like LXC w/o a
k8s like scheduler or orchestration layer. Kolla ansible seems to have
momentum and be blessed but no one there had much to say about it otherwise.
I'm really interested in this area of inquiry but at the very present
moment I think operating our entities as ships-in-the-night has a lot of
benefit as the tangle of integration runs deep and muddy.
2017-12-06 20:00:02,969 INFO force is enabled
2017-12-06 20:00:03,016 INFO removing misc-project-backup
2017-12-06 20:00:03,116 INFO removing misc-project-backup
2017-12-06 20:00:03,763 INFO creating misc-project-backup at 2T
2017-12-06 20:00:04,629 INFO force is enabled
2017-12-06 20:00:04,674 INFO removing misc-snap
2017-12-06 20:00:04,736 INFO removing misc-snap
2017-12-06 20:00:05,022 INFO creating misc-snap at 1T
2017-12-05 20:00:03,140 INFO force is enabled
2017-12-05 20:00:03,170 INFO removing tools-project-backup
2017-12-05 20:00:03,212 INFO removing tools-project-backup
2017-12-05 20:00:03,719 INFO creating tools-project-backup at 2T
2017-12-05 20:00:04,439 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2017-12-05 20:00:04,440 INFO force is enabled
2017-12-05 20:00:04,496 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2017-12-05 20:00:04,497 INFO removing tools-snap
2017-12-05 20:00:04,513 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2017-12-05 20:00:04,530 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2017-12-05 20:00:04,531 INFO removing tools-snap
2017-12-05 20:00:06,107 ERROR b' /dev/tools/tools-snap: read failed after 0 of 4096 at 8796092956672: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 8796093014016: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 0: Input/output error\n /dev/tools/tools-snap: read failed after 0 of 4096 at 4096: Input/output error\n'
2017-12-05 20:00:06,152 INFO creating tools-snap at 1T
2017-11-29 20:00:02,315 INFO force is enabled
2017-11-29 20:00:02,357 INFO removing misc-project-backup
2017-11-29 20:00:02,456 INFO removing misc-project-backup
2017-11-29 20:00:03,196 INFO creating misc-project-backup at 2T
2017-11-29 20:00:03,960 INFO force is enabled
2017-11-29 20:00:04,009 INFO removing misc-snap
2017-11-29 20:00:04,060 INFO removing misc-snap
2017-11-29 20:00:04,532 INFO creating misc-snap at 1T
Most people attending meeting with ops or SOS will be familiar with it, but
on this quarter we will have some potentially breaking changes on the
mediawiki databases:
* T177208: We have scheduled for the 9th January moving wikidata from "s5"
to "s8" to provide dedicated resources to it. dewiki will stay at "s5".
Config is already prepared:
https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php but it will
not be effective until that day, when we will have a read-only period to
perform the split. Most of mediawiki should be ok with it, but there could
be scripts hardcoding wikidata is on s5 (specially on toolforge scripts- we
should coordinate an announcement with cloud there)
* T178359: We now have production multi-instance hosts (meaning multiple
instances per physical host). That means that mysql will not be available
only on port 3306 (default), you may have to indicate a specific socket
(mysql --skip-ssl --socket=/run/mysqld/mysqld.s1.sock) or port (-P3311).
You can see the list of hosts on tendril or grafana.
Having this on an email can become handy in case of emergency.
Cheers,
--
Jaime Crespo
<http://wikimedia.org>
2017-11-28 20:00:03,179 INFO force is enabled
2017-11-28 20:00:03,228 INFO removing tools-project-backup
2017-11-28 20:00:03,328 INFO removing tools-project-backup
2017-11-28 20:00:03,978 INFO creating tools-project-backup at 2T
2017-11-28 20:00:04,847 INFO force is enabled
2017-11-28 20:00:04,879 INFO removing tools-snap
2017-11-28 20:00:04,913 INFO removing tools-snap
2017-11-28 20:00:06,096 INFO creating tools-snap at 1T
I tried to whip something up. I don't think it is trivially easy or
horribly hard, but I'd like y'all to take a quick look before I give
it to Liz so she can start handing it out to candidates. There's one
sneaky question in there that a candidate may figure out an answer to,
but its mostly to see if they can ask good follow up questions. The
others should all be not too hard to find in the support documents I
linked (I hope).
https://etherpad.wikimedia.org/p/WMCS-techsupport-task
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855