Hi there,
We are currently working on replacing older hardware servers with newer
ones, in particular those dedicated to cloud networking [0].
We have discovered a few shortcomings related mostly to network
interface naming in the newer servers, and the latest openstack version
behaving differently to what it used to be, and also some base operating
system (debian) bugs [1]. Some of these are hardware-dependant and
difficult to reproduce/anticipate in our staging environment.
The result is that we are having a more challenging and noisy migration
than we would like. We already had a few (brief) network outages trying
to introduce the new servers into service.
We'll try to keep things as stable as possible in the next few days
until the migration is completed, but we can't discard having some more
(brief) network outages until we are safely on the other side of the
transition.
I'll send another note when we finish this network maintenance is over.
regards.
[0] https://phabricator.wikimedia.org/T316284
[1] https://bugs.debian.org/989162
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
In my daily reboots, I'm coming across a few weird VMs that should be on
ceph but instead are using local storage on hypervisors. I've tracked
this down to particular flavors which don't specify the Ceph backend.
For example, here's a 'good' flavor:
root@cloudcontrol1003:~# openstack flavor show g3.cores4.ram8.disk20
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | None |
| description | None |
| disk | 20 |
| id | c14b5856-5e6a-4f7a-8125-ace4f616c299 |
| name | g3.cores4.ram8.disk20 |
| os-flavor-access:is_public | True |
| properties |
aggregate_instance_extra_specs:ceph='true',
quota:disk_read_iops_sec='5000', quota:disk_total_bytes_sec='200000000',
quota:disk_write_iops_sec='500' |
| ram | 8192 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 4 |
+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
...and here's a bad one...
root@cloudcontrol1003:~# openstack flavor show g3.cores24.ram122.disk20
+----------------------------+--------------------------------------+
| Field | Value |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | ['wikiwho'] |
| description | None |
| disk | 20 |
| id | 6207fade-5517-4ac4-a147-6ae0c8fa1384 |
| name | g3.cores24.ram122.disk20 |
| os-flavor-access:is_public | False |
| properties | |
| ram | 124928 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 24 |
+----------------------------+--------------------------------------+
Note the lack of properties set on the latter.
We should definitely investigate whether to have flavors without
properties have good defaults -- at this point we're setting those same
properties for almost every flavor which seems silly. In the meantime,
though, it's important that those properties be specified on new
flavors, otherwise VMs with those flavors can be consigned to purgatory
and weird behavior.
The docs about flavor creation are here:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_flavors#Gener…
most likely that should be duplicated or linked from other places though.
Thanks!
-A
Hi there,
although I don't think there is a strong mandate -or policy- to migrate
everything to gitlab just yet, we already have a few repos here and
there. And there is agreement that gitlab in general is beneficial for
us and we should be slowly paying more attention to it as we go.
Naming can be hard, and organizing stuff based on naming can also be
hard. But we already have some kind of "tree" on gitlab. Let me try to
brain dump the mental image I have, and feel free to share / discuss as
required.
Main entry point:
https://gitlab.wikimedia.org/repos/cloud
The repos/cloud group is our main "directory". Everything that resembles
something that WMCS + collabs would maintain as part of the base
services+infra we have should be there. Explicitly excluded are
user-controlled stuff, like Toolforge tool source code etc.
Remember, being under /repos/ gives us some additional features (like
trusted CI runners) and even more /repos/cloud/ already contains some
membership that will be inherited by all child repos.
This group doesn't contain any direct repository, but instead several
subgroups. The only exception is a "wikistats"[1] repo that I'll be
requesting Daniel Zahn to relocate elsewhere.
Child subgroup:
https://gitlab.wikimedia.org/repos/cloud/toolforge
Everything Toolforge, components for k8s, etc.
The only "official" repo here is
https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx and the
rest are placeholder/tests
Child subgroup:
https://gitlab.wikimedia.org/repos/cloud/cicd
See
https://lists.wikimedia.org/hyperkitty/list/cloud-admin@lists.wikimedia.org…
Child subgroup:
https://gitlab.wikimedia.org/repos/cloud/deb
I just created this the other day to store .deb packaging repos. Created
the first here
https://gitlab.wikimedia.org/repos/cloud/deb/pkg-prometheus-openstack-expor…
I guess we could introduce other subgroups as we go, imagine for example:
* /repos/cloud/vps/
* /repos/cloud/paws/
* /repos/cloud/quarry/
*/ repos/cloud/you_name_it/
Comments welcome.
regards.
[1] https://gitlab.wikimedia.org/repos/cloud/wikistats
--
Arturo Borrero Gonzalez
Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi there,
Today 2022-04-06 we're performing some network maintenance operations on
Cloud VPS that could affect all cloud egress/ingress traffic, including
Toolforge. The cuts, if noticeable, should last a few minutes at most.
Some operations were also conducted yesterday (without this email
notice), and some unexpected hiccups occurred. That's why the email today.
regards.
--
Arturo Borrero Gonzalez
Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi there,
I've been playing this morning with gitlab.w.o and CI, and here is my
proposal on how to work with it:
* have all gitlab-ci related stuff consolidated into a single repository
[0], this includes:
** generic gitlab-ci.yaml config files as required [1]
** Dockerfiles for the images used in the above gitlab-ci.yaml files [2]
* in each repo that requires CI (likely all), instead of having to
re-write the gitlab-ci.yaml file every time, "include" it from the repo
above. [3]
* due to upstream docker registry ratelimits, we need to do some heavy
caching in our docker registry (docker-registry.tools.wmflabs.org),
which even involves scp the base docker image from ones laptop
(something like [4]) because you can't even pull the base images at
docker.io from tools-docker-imagebuilder-01.
To see a live demo/example of this, here is a successful tox job for a
python 3.9 repository:
https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api/-/job…
I prefer if we have our own CI/CD stuff in a separate repo/docker
registry for now. I don't think there is a unified effort for this
unlike in gerrit.
I think this kind of CI stuff is one of the main missing pieces that was
previously preventing us from adopting gitlab for good.
PD: there are a bunch of things to automate here, like base image
maintenance and such. Will wait to see if this proposed workflow is
something we're interested in.
regards.
[0] https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci
[1]
https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/py3.9-b…
[2]
https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/blob/main/py3.9-b…
[3]
https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api/-/blo…
[4]
https://stackoverflow.com/questions/23935141/how-to-copy-docker-images-from…
--
Arturo Borrero Gonzalez
Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
This meeting coalesced around a few major topics:
* Why not just bring-your-own-container?
** We turn out to have fairly different ideas about what user
experiences we want to support.
** General agreement that we could use more research and/or
documentation about current and future workflows (although some of that
already exists at
https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences
** This question hinges on how sophisticated our users are or aren't.
** Komla suggests that many public cloud platforms provide simple
deployment that don't require users to understand about container or k8s
details; there's general agreement that we would also like to be like that
** Andrew thinks that push-to-deploy should be our 'main priority' and
byo doesn't really address that.
* How/When will we kill off the grid engine?
** We tend to think of this as blocked by push-to-deploy, but perhaps we
should be open to other non-blocking options (e.g. the 'one big
container' migration path)
* What to do about the Stretch->Buster migration?
** Nicholas isn't convinced that we should migrate to Buster if we're
just going to kill the grid eventually anyway. Andrew and Arturo mostly
disagree.
** Probably the migration to Buster isn't a lot of hard mental work,
just building things and throwing pre-existing switches.
** Main blocker for this is allocating time and tasking someone with
doing the work
------ RAW ETHERPAD NOTES ------
== Toolforge next steps meeting 2021-12-14 ==
The approaching deadlines (from an announcement email on 2021-09-27) is:
January 1st, 2022:
* Stretch alternatives will be available for tool migration in Toolforge
The proposed agenda follows.
=== Goals (Grid) ===
* Grid engine deprecation is blocked until users can customize container
builds. Buildpacks are intended to address this need.
* The blocker is providing equivalent runtime support for tools on the
Kubernetes cluster as the current grid engine cluster has. Kubernetes
containers are "thin" and tools will need the ability to add libraries
and binaries that are custom to them.
=== Goals (Buildpacks) ===
** Why are we doing this?
*** Allow users to customize k8s images.. Better able to allow users to
migrate off of gridengine
* Arturo asks: why not just bring-your-own-container?
** Bryan answers: because bring-your-own container means containers
without any toolforge integration (e.g. no ldap)
* What about putting that in a base layer?
** That's a build your own container approach, which is what buildpacks
is bringing
** But allowing build-your-own today is simpler than buildpacks for SREs
** Adding complexity for end users
* Building your own docker image adds more complexity for end users
** Buildpacks also limit what you can put in a docker container, so
potentially better security
**
* How does buildpacks improve security?
** If you let someone else build a docker image, it could run as a
different user and open security holes
** How could it be limited in k8s? The container runs as root on the host
** By building the container in buildpacks, we limit it
** k8s has full control over the runc runtime, so k8s could prevent user
spoofing
* Public clouds have bring your own container, so it must be possible right?
* Would like to see a list of prioritized user workflows
** What workflows are we enabling with buildpacks?
** Consider looking at
https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Our_audiences
* What's the long-term vision for TF?
** Push to git repo, and it "just works"
** The heroku workflow is built on buildpacks
* What's the concern / fear about buildpacks?
** Complexity, can we find a way to simplify things?
** Lack of adoption (what if we end up the only ones using it)
* Bringing your own container isn't a regression, but it's not a
replacement for existing workflows (aka, we can't kill the grid by
simply adding bring your own container)
** Why?
** Workflow and brainstate. Users will no longer understand how to run
workflows. Running a job is much simpler than building and maintaining a
container to run a job. Real risk of losing tools
* Can we assume complex tool authors are technically capable of building
a container?
** Complexity can be easily introduced in the grid; that doesn't mean
they could build and maintain a container
** `webservice --backend=gridengine start` supports PHP, Python, and
Perl fcgi out of the box.
* Google cloud comparision -- industry seems to be using towards
containers, but users don't need to build a container or even know that
you are running in a container. Build a flask app, run a command, GCE
builds a docker container and runs it
** all you need is for the buildpack to 'detect' what runtimes you need
(it can be a file, or chekcking for a packages.lock or whatever)
* So David et la -- does buildpacks seem complex?
** Buildpacks are easy-ish. Complexity is introduced by putting
buildpacks into Toolforge
** Tekton / buildpacks PoC is easy
* Build service
** admission controller, some custom resource definitions, hardware. Any
docker registry could be used? Harbor presented some issues in running
outside of k8s
* How much engineering effort to bring that in prod? Is it possible to
bring online this year?
** Depends on resources, and how org handles things
** Yes, if we work on it? :-)
* Priority wise, a push to deploy solution is the most important /
seamless thing we could work on
** Push to deploy toolforge, before grid deprecation even?
** Can't get rid of the grid until reasonable replacement
** Why?
** Because we need to support people today
** Actually need mixed runtime environment deployable on k8s
* For example, 1 giant container that contains everything from a grid
exec node
** So grid isn't dependent on push to deploy exactly
=== Goals (Buster) ===
* Have a plan for Buster on grid engine
* Decide whether or not to have a plan for Buster on k8s (and,
optionally, have a plan)
* Decide what timeline adjustment is realistic; pick someone to
communicate this delay to the users
==== Grid Engine ====
* Why do we hate the grid?
** When jobs are being run, little isolation. Uses watcher spawned
processes + runtime hacks.
** No longer developed or supported by any upstreams
** In ~2018? looked at modern "grids" that spawned things similar, but
could be better managed. (IE, slurm)
** At the time, decided k8s was the future, and decided that slurm or
similar wasn't a good idea
* Grid is important -- what could we do this year?
** Find someone to build a buster grid
** Make a mad dash at killing the grid asap.
* Giant container is unlocked now , previously limited to 2G
** 3.1G container, needed on each k8s exec node
** Only need 1 copy on each node; shared between jobs
** When building new containers, be wary of variants when deploying new
containers. Could be N x 3.1G
** Large containers aren't performant on k8s
* Buster migration
** Most of the pieces are in place
** Build out nodes, switchover
* Why the timeline?
** Organizational timeline
** Grid is senstive to DNS changes
<discuss>
<decide>
<who>
==== k8s ====
<discuss>
<decide>
<who>
==== timeline ====
<discuss>
<decide>
<who>
* dcaro: I propose delaying any non-urgent decision until we finish with
the decision making/tech discussion email thread
** I would argue that the stretch deprecation is starting to become
urgent (although I don't have context about that email thread)
=== Current status, open projects ===
The list may be the list of open projects:
* stretch-to-buster migration final push
** including but not limited to the grid
* grid engine deprecation timeline & plans
** draft here
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
* pending kubernetes upgrades (if any)
** yes, current clusters are at 1.20 and 1.23 is the latest release.
1.21 is simple (and I think it would be a good opportunity for someone
to learn the process) but 1.22 is complicated because it removes tons of
deprecated things
** 1.20 is still in supported, support policy is last 4 releases
* toolforge build service (buildpacks)
** currently we don't really have any visibility to package versions and
available (security) upgrades in our images or deployed image versions -
can we improve this with buildpack images or otherwise?
https://phabricator.wikimedia.org/T291908
* toolforge jobs framework
=== Next steps, prioritization ===
* what to do next, and who
=== Long term future ===
* Share your ideas of how Toolforge should look in 5 years from now
Context:
"Galera on cloudcontrol1004 going out of sync"
https://phabricator.wikimedia.org/T302146
Galera (the database backend for OpenStack) has been very unstable ever
since I upgraded the cluster to Bullseye. This is probably an issue with
a buggy version of mariadb/galera.
I'm trying an experiment: mariadb is currently stopped on
cloudcontrol1004, and puppet disabled so it won't get restarted. I want
to see if that change (a two-node cluster and/or removing the
suspected-cursed cloudcontrol1004 from the cluster) causes things to
stop breaking.
I've done my best to downtime alerts that will result from this, bug if
one leaks through please don't respond by enabling puppet on 1004 -- we
want to leave that db node switched off for now.
Other services on cloudcontrol1004 should continue to run normally.
Thanks!
-Andrew
Hi there,
Apparently linkwatcher is a key tool in fighting vandalism in the wikis,
but we cannot afford hosting it on ToolsDB any longer.
I requests your comments / suggestions.
Timeline:
==== 8< ====
== 2019-05-22 ==
Brooke detects high storage usage problem by the linkwatcher tool in
ToolsDB [0].
[0] https://phabricator.wikimedia.org/T224154
== 2019-06-26 ==
Bryan warns [1] about the linkwatcher tool potentially becoming a
disaster condition for ToolsDB.
[1] https://phabricator.wikimedia.org/T224154#5284580
== 2019-07-19 ==
After some actions are taken to move linkwatcher to its own Cloud VPS
project, activity stops and the database data remains in ToolsDB [2].
[2] https://phabricator.wikimedia.org/T227377
== 2022-02-17 ==
ToolsDB stops replicating to its secondary due to depleted storage [3].
It is discovered [4] that linkwatcher uses 1/3 of total storage of
ToolsDB, this is ~1TB out of ~3TB. Data reported in other metrics [5]
may not be accurate.
While it may not be the primary cause of the problem, the "disaster
condition" mentioned by Bryan 3 years earlier is becoming more apparent.
[3] https://phabricator.wikimedia.org/T301951
[4] https://phabricator.wikimedia.org/T301967
[5] https://tool-db-usage.toolforge.org/
==== 8< ====
regards
--
Arturo Borrero Gonzalez
Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation