It's still all me on this thread!

This is a note for all of us since we've talked about unattended upgrades and such as of late. I feel like teh folks on this list are on the same page but a real world example of recent thinking.

We recently fought with https://phabricator.wikimedia.org/T182722 which involved rebooting workers. We had been sitting on pending kernel updates for Debian instances in https://phabricator.wikimedia.org/T180809 because WMF unattended pulled in new kernels. At the moment the workers are sitting on 4.9.0-0.bpo.4-amd64 now and all other Debian instances in Tools are sitting on 4.4.0-3-amd64. Considering the historical virtio issues and the nightmare of debug I feel like this reinforces our strategy outlined in https://phabricator.wikimedia.org/T181647 to make make managing updates explicit and ongoing for Toolforge (and novaproxy or other WMCS managed resources).

root@tools-worker-1016:~# uname -a

Linux tools-worker-1016 4.9.0-0.bpo.4-amd64 #1 SMP Debian 4.9.51-1~bpo8+1 (2017-10-17) x86_64 GNU/Linux

root@tools-puppetmaster-01:/var/lib/git/operations/puppet# uname -a

Linux tools-puppetmaster-01 4.4.0-3-amd64 #1 SMP Debian 4.4.2-3+wmf8 (2016-12-22) x86_64 GNU/Linux

I really dislike this kind of inconsistency.

On Mon, Dec 11, 2017 at 2:04 PM, Chase Pettet <cpettet@wikimedia.org> wrote:

Replying to myself to separate out review from new thoughts :)

Two things I wanted to comment on. 1) traffic control and thoughts 2) deployment naming

1) I think the conclusion was nftables did not have a TC equivalent and that current techniques were not able to replace our TC usage. We could get more savvy with targeting using nftables (or iptables) and potentially the nftables project would look at a traffic control type mechanism.

I talked a little about where I stopped short in the existing implementation (of bastion resource QoS) at building a wrapper for TC and in trying to find all the avenues I went down a few years ago I found https://packages.debian.org/stretch/firehol which appears to be just that. It has the potential to be unwieldy being sprawling bash but super interesting (https://github.com/firehol/firehol/blob/master/sbin/fireqos).

I /think/ https://wiki.nftables.org/wiki-nftables/index.php/Rate_limiting_matchings from the meeting doc (https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Offsite_Notes/kubecon-2017) would perform similar functions to the limit module in iptables which drops violators instead of managing a queue to ensure both ends are sane consumers within the defined throughput limits. I think this doesn't exactly fit our model. Need to talk to Arturo to confirm I grok this entirely :) Super excited about the possibility of making our TC setup potentially more dynamic and sane, and also moving to something more modern.

2) I think we should sidestep the meaningful names pitfall and go for something distinct but not inherently descriptive. Main will be problematic whenever it becomes not "main" in the implicit sense (see me naming something secondary that then becomes ...primary). Anything we try to brand w/ a relationship to use case has this issue. I have used colors before to this end "blue", "black", "orange" environments. That's just an example, I actually think we should go wholly generic and use numeric identify. depone, deptwo, depthree potentially. phonetic: "dep-one", "dep-two". Contextual "one" "two". We can move the server from "one" to "three". IDK. I hate pure numeric tagging less than other approaches I can think of. Not in love with "dep" as a prefix. Ideas needed.

On Mon, Dec 11, 2017 at 9:01 AM, Chase Pettet <cpettet@wikimedia.org> wrote:
Original: https://etherpad.wikimedia.org/p/kubecon-2017-offsite-agenda

Archived on office wiki: https://office.wikimedia.org/wiki/Wikimedia_Cloud_Services/Offsite_Notes/kubecon-2017

Decided:

- Stick with Trusty through Neutron migration (for now as we think we are making enough progress on this to ensure Trusty sunset by April 2019. Xenial seems to have Mitaka so if we have to potentially we can match mitaka there with Trusty for a migration of OpenStack releases across releases but that's work we don't want to do and we need to settle on a distro (see: figure out deployment methodology))
- https://phabricator.wikimedia.org/T166845 to be done via cumin for now (long term prometheus?)
- draw.io is a canonical tool
- Dumps work is a carry over goal
- Neutron will be a carry over goal but hopefully not a literal one

Open Near Term:

- Neutron Plan: talk about the naming of deployents

- Need to do hard capacity thinking on storage scaling and budgeting

- icon templates for draw.io

Open Long(er) Term:
- Need to figure out openstack components deploy methodology (containers, source, distro packaging...)
- Is SLURM viable?
- kubeadm for Kubernetes deploy?
- Tools Bastion and resource issues
- Is there an upstream alternative that is viable for Quarry?
- How much do we fold into cloud-init?
- Do we use puppet standalone and virtualize the main cloud masters?
- Hiera for instances is a mess and needs to be rethought.
- Trial of paging duty cycles (while still taking advantage of our time spread)
- How much of labtest is ready for parity testing?
- Document undoing standalone puppetmaster setup

Missed because of time:

- puppet horizon interface future (https://phabricator.wikimedia.org/T181551 and co)

FUTURE IDEAS: NOT EXISTING
- new TLDs for public and internal addresses: when and how to deploy
- new ingress for HTTPS in VPS and Toolforge?
- monitoring sanely
- thinking about ceph
- metrics for end-users - who uses my tools and how? (https://phabricator.wikimedia.org/T178834 and co)

I would like to talk about the missed items if we can find a few minutes at all hands (over dinner?)

--
Chase Pettet
chasemp on phabricator and IRC

--
Chase Pettet
chasemp on phabricator and IRC

Chase Pettet

chasemp on phabricator and IRC