Hi!
I have opened a new task [1] to decide on setting (or not) an upgrade cadence for our Ceph cluster [2].
Your input is more than welcome on the task itself or on this email thread.
There's no deadline, but if there's not a lot of discussion this could be decided right after the holidays.
You can find this one and other ongoing proposals here [3].
Thanks!
[1] https://phabricator.wikimedia.org/T325223
[2] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Ceph
[3] https://phabricator.wikimedia.org/project/board/5263/
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Hi!
As we have been working on gathering and defining user stories for the Toolforge Build Service and Toolforge itself, I
have been thinking about the next steps for both of them, and surroundings and I wanted to share them and have some
discussion to try to give a bit more direction to our work in those areas.
== Tl;Dr;
Let's think without constraints on what we want toolforge to become.
My opinion:
* Move towards full Platform as a Service
** this means users only interface with our platform
** this might mean offering k8s as a service on top of CloudVPS if needed
* Simple thin client
* Simple thin UI (for people that can't/don't want to use the client)
* API that supports both the above
== Long description
I think that this is somewhat a popular idea, but I would like to but I would like toolforge to be as easy to use as
digitalocean and heroku, that is, a PaaS platform.
This means:
* No need for ssh
* Very simple cli (from the user's computer)
* Simple web UI (same capabilities as the cli, for anyone that can't install the cli)
This also means:
* No k8s as a service (discussed later)
* Detaching the users from the underlying implementation
I know that this might require lots of changes, and those are not easy, but let's focus on the features we want, not the
design underneath yet.
What I would like is to have some set of "components" that I can use and combine to create my tool:
Storage:
* Store structured data somewhere (db)
* Store unstructured data somewhere (storage/file-like?/s3?)
Compute:
* Something that runs periodically (cron-like)
* Something that runs once (one-off)
* Something that runs continuously (daemon)
Network:
* Create a public entry point for a web service
* Connect between my components
So inspired by the digitalocean[1] and heroku[2] clis, the toolforge cli could just do:
* toolforge run
* toolforge run-once
* toolforge run-every
* toolforge db
* toolforge storage
* toolforge expose-port (--public|--local)
Some side-commands could be:
* toolforge tool -> to manage tools themselves, (create/add-maintainer/remove-maintainer/...)
* toolforge get-all -> to list all my components
* toolforge logs -> get the logs for a component
* toolforge shell -> start a shell inside a component container (similar to heroku bash), for debugging
* toolforge edit-config -> to allow to do the above trough some kind of structured spec
This is not an exhaustive list, but this should cover most of the usecases.
You might be asking now, what about people that needs some extra features from k8s?
For those, we can offer k8s as a service (through CloudVPS + terraform for example), so they have full control of their
k8s instances.
Note that I have tried to refrain myself from adding any implementation details yet, as I think that we should do the
exercise of thinking what we want without limiting ourselves on how we think it could be done.
The limitations will come later :)
== Some random stats for current k8s toolforge usage
Total number of namespaces:
3163
Of which, namespaces that are empty:
1496
That means that only 1667 have something, for those, number of k8s webservices:
1276
Number of grid webservices:
307
Number of tools with cronjobs:
71
Number of tools with >1 cronjob
47
Number of tools with >10 cronjob
6
Number of tools with manually defined resources:
51
Of which I checked a few, and could be sorted out with "continuous jobs", as in daemons, though I have not reviewed all
of them in detail.
[1] https://docs.digitalocean.com/reference/doctl/reference/apps
[2] https://devcenter.heroku.com/categories/command-line
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Hi there!
On 2022-11-28 and 2022-11-29 there has been some misleading emails being
sent: you may have receive one (or more) emails about puppet failures on
your Cloud VPS virtual machine.
Moreover, such emails were a bit contradictory, with messages like
"No failed resources", and "No exceptions happened".
There was a problem in the way the puppet errors were calculated that
has been now fixed [0].
This does not affect Toolforge.
sorry for the noise,
regards.
[0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/861805/
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi there,
Today 2022-11-22 at about 12:25 UTC, as part of a routine operation I
reimaged/reformated a cloudvirt hypervisor without relocating all the
virtual machines first.
The data survived the reimage, but the 32 (!) affected virtual machines
were briefly unavailable and then hard-rebooted.
All virtual machines are now ACTIVE (up and running) from the openstack
point of view, but please, let me know if you need assistance recovering
them in any way.
As of this writing we don't have any automation to ensure we only
reimage empty hypervisors, but we're working on it, to prevent this kind
of human errors in the future.
regards. (and sorry!)
(!) Affected virtual machines are:
- ID: 78782628-4f9f-4263-84fc-06e767b3bfe1
Name: mx-wiki
- ID: 1fa9f0d9-42e8-4273-bdb1-a7d49998c13f
Name: synapse01
- ID: 2382fda0-e683-4d0c-95b6-bbbf323904d9
Name: canary1048-04
- ID: 4b570277-e51f-459d-bea2-394c5ad7bc92
Name: tools-sgeexec-10-16
- ID: 66529c1b-f3a3-4ff8-b30d-785f4f274965
Name: feature-store-test
- ID: e153f69a-46a0-458a-ab50-de3d86aa861b
Name: toolsbeta-test-k8s-worker-7
- ID: c3a2d1a9-f811-4da9-afba-3a113c8ff729
Name: wbregistry-02
- ID: 2b56c575-08a5-4def-87cb-bee5bd43e4f9
Name: prod
- ID: 141ac13c-f0fa-46d3-9d2a-cede8bc854c6
Name: devtools-puppetdb1001
- ID: fdb15c24-0b41-42d6-9c4a-82afd1d9dcb9
Name: tools-sgeweblight-10-31
- ID: 56e55a31-8d32-455e-b650-b7194e71d2fd
Name: runner-1023
- ID: cb4a87e4-264e-4c8f-8197-3efff54346de
Name: runner-1022
- ID: 5b6b5733-565d-456e-a4fc-85ce669d3fd2
Name: deployment-mdb02
- ID: 75dce76d-36ad-4f9e-85e9-8a11ff6744db
Name: wikibase-product-testing-2022
- ID: 868d3dca-3e5c-4089-89a9-2c7e756c3e31
Name: toolsbeta-cumin-1
- ID: 42ac6d8a-453a-4620-b4b7-9c97994c98fb
Name: integration-agent-docker-1030
- ID: 084da652-503d-49a7-9ffa-98a0cd5335fd
Name: toolsbeta-sgeexec-10-5
- ID: f098fe82-18b6-49a9-962d-9b8f1f989b14
Name: pcc-worker1001
- ID: 8eb272dc-8006-4e93-a966-5035809324d9
Name: deployment-mx03
- ID: e67d0e4c-e07c-4d9a-8ddb-cb0bc8efa388
Name: deployment-docker-api-gateway01
- ID: b958511a-10cb-4e62-bdbb-6da5013dd62f
Name: soweego
- ID: 62045cf9-59ed-44b9-a268-1c9f171b5aae
Name: tools-package-builder-04
- ID: 0127e905-f52e-4ed4-b60d-260102a8e625
Name: pontoon-lb-02
- ID: 827bf744-262a-458b-951d-f2e9a377e075
Name: toolsbeta-test-k8s-ingress-3
- ID: 3e6c31d7-b4db-4a5f-a610-a74d0013f631
Name: pki-test01
- ID: 8893ba32-fb5c-4567-a242-b6c676978b7d
Name: deployment-urldownloader03
- ID: f72e5b18-6376-4ccd-9e59-64447759e53f
Name: deployment-deploy03
- ID: 006dea0a-a1eb-4de3-bf45-1a071ad87152
Name: kafka-test-cloud-2
- ID: e05220d7-8ca1-4d9f-a933-01a843286ea8
Name: toolsbeta-docker-imagebuilder-01
- ID: 416f445a-cad4-45c2-b32e-f17100f93eac
Name: cloud-puppetmaster-05
- ID: 4e492051-25a3-4442-b8b9-1959f42917fe
Name: tools-k8s-worker-76
- ID: df18863a-2da7-4951-aa69-936b3d889592
Name: deployment-docker-cpjobqueue01
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
I think we could start monitoring prometheus-node-exporter on all Cloud
VPS VMs on all projects via the Prometheus instance in metricsinfra. The
required firewall rules are now in place (thanks to Andrew in T288108),
and I've written the required patches to
cloud/metricsinfra/prometheus-manager and to the Puppet repo:
https://gerrit.wikimedia.org/r/c/cloud/metricsinfra/prometheus-manager/+/85…https://gerrit.wikimedia.org/r/c/operations/puppet/+/856917/
The main effect this will have is that we (and project admins, of
course) will have basic metrics (think CPU, disk, RAM, so on) for all
instances in all projects. Currently these wouldn't send any alerts
unless manually configured by a metricsinfra admin.
Please let me know if you have any questions or concerns, otherwise I'd
like to move forward in the next few days.
Taavi
Hi there,
Toolforge is a complex service. There are many moving parts and there
are always several people working on different pieces of it.
We have been creating informal Toolforge-specific meetings from time to
time, to unblock some decisions or to get everyone on the same page.
The proposal is to create a monthly 1h Toolforge engineering-focused
meeting called "Toolforge council".
This meeting would be open in nature:
* The WMCS/TE team
* Toolforge community root group member [0]
* Other interested parties can be invited if required
The notes and results of the meeting will be published somewhere in
wikitech and perhaps this very mailing lists.
The next two meetings of this kind will be:
* 2022-11-08 at 15:00 UTC
* 2022-12-13 at 15:00 UTC
For these next two, I will facilitate/moderate them, as well as
collect/share some agenda points beforehand.
I would like to avoid formalizing any other protocols regarding the
meeting beyond what is contained in this email. It is already an
evolution to the informal approach we have been using. Let's see how it
evolves organically?
Comments welcome (including naming hehe).
[0]
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#What_makes_a_roo…
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi all!
I'm trying to gather what python versions are needed to be supported to run cookbooks, I would appreciate if you can
reply to this email telling me which version you would be running cookbooks with (directly to me is ok, to avoid spamming others ;) ).
Thanks!
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Continuing from my post on cloud@...
On Thu, Oct 6, 2022 at 6:21 PM Bryan Davis <bd808(a)wikimedia.org> wrote:
>
> On Thu, Oct 6, 2022 at 5:39 AM Taavi Väänänen <hi(a)taavi.wtf> wrote:
> >
> > In general, I feel that over the last few months,
> > quite a lot of planning and progress reporting has moved from our
> > various public channels (most notably Phabricator and -cloud-admin on
> > IRC) to private ones. I don't particularly like this trend.
>
> I did a thing in my late afternoon yesterday that may have aggravated
> Tavvi's feelings of being left out of decision loops.
>
> I made a decision without consulting any other Toolforge admins to add
> about 300MiB of fonts to the php7.4 Docker image available for use on
> Toolforge [0]. This decision reversed my prior blocking of this exact
> same request in 2019 [1]. It also goes against at least as many years
> of the Toolforge admins telling the Toolforge member community that we
> do not "bloat" the Kubernetes containers with specialty features for a
> small number of use cases. This reversal will complicate future
> decisions on such issues by introducing this easily seen counter
> example. I acted with good intent in the moment, but I did not act
> with good judgement nor consideration of my partners in maintaining
> the Toolforge infrastructure. For that I am truly sorry.
>
> I would also like to apologize for treating what I was doing as
> "urgent" when it could have easily waited for a discussion with others
> either in code review or in other forums. This false urgency was
> counter to what I know to be the best way to treat technical decisions
> and it was disrespectful of my co-admins in the Toolforge environment.
>
> I would also like to have a conversation among the Toolforge admins
> about how to best deal with this decision going forward. That
> conversation is probably better had on Phabricator or the cloud-admin
> mailing list than here, but it should happen and it should result in
> either reverting the change that I made or jointly creating updated
> guidelines for what is and is not acceptable in the shared Kubernetes
> containers while we await better methods of managing per-tool feature
> differences.
>
> [0]: https://phabricator.wikimedia.org/T310435#8288848
> [1]: https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+…
For the fonts themselves, should we:
* Revert the change and tell svgtranslate to move back to the grid?
* Propagate the change outward by making the same/similar change to
all php images?
* Propagate the change outward by making the same/similar change to
all base images?
* Let it be.
For the bigger picture of breaking our long held stance on "bloat", I
would like to hear suggestions from y'all. If the font decision is to
revert then maybe there is nothing to talk about here. If the fonts
stay then I think there is a need to either document this as a rogue
action that has been allowed to stand which should not set a precedent
for the future or to come up with a rubric for what is allowed and
why.
I am also open to hearing from anyone on or off list who feels that I
need to make additional amends to the Toolforge admins, the Toolforge
user community, or any particular individuals. I really didn't mean to
make a mess, but I did and I would like to work towards correcting
that as much as possible.
Bryan
PS I will be out of office until 2022-10-11, but I will try to check
in on this thread in the intervening days.
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Hi there,
just today I introduced the $::wmcs_project var to replace $::labsproject
https://gerrit.wikimedia.org/r/c/operations/puppet/+/849050
Please use the modern one.
Perhaps I should try to nerd-snipe someone to see if we can have the
linter reject new patches with the old variable.
regards.
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi cloud-admin@,
The recent cloud@ thread made me realize that I should probably keep
everyone else more up to date on the infrastructure level projects I'm
working on by myself. So I've tried to summarize the major recent and
upcoming changes I'm working on below in semi-random order.
Please let me know if you find this useful or interesting (or if you
don't, helps to know that too). Questions and comments on are also welcome.
Terraform
I sent Puppet patches[0] to enable application credential authentication
in Keystone to let arbitrary clients speak to the OpenStack APIs. I
believe Andrew is working on the firewall rules and related HAProxy
config to open up the APIs to the public as a part of the
Cumin/Spicerack work going on at the moment.
I tagged version the initial version of the custom terraform-cloudvps
Terraform provider.[1] The provider is designed to supplement the
'official' OpenStack provider and currently lets you interact with the
web proxy API using the new go-cloudvps library[1], with Puppet ENC
support next up in my Terraform TODO list.
There's also a Puppet patch[3] pending to configure a self-hosted
Terraform registry on terraform.wmcloud.org. It's cherry picked to the
project puppet master, but having at actually merged would be nice.
[0]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/840121
[1]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/terraform-cloudvps
[2]: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/go-cloudvps
[3]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/834344
CloudVPS web proxy
Planning on doing some work to make the proxy service more reliable in
case of node failure. Also planned is moving the current SQLite database
to the cloudinfra MariaDB cluster for reliability / easier failover
purposes. There are a few Puppet patches prepping for this pending
review, starting from [4].
[4]: https://gerrit.wikimedia.org/r/c/operations/puppet/+/831041
Toolforge
Sent a few patches to the jobs-framework-* repositories. Planning to do
a bit more cleanup here, to hopefully make the grid migration easier.
I'd like to introduce a new k8s utility, kube-container-updater[5], to
automatically restart long-running containers that are running outdated
images.
Upgrading to Kubernetes 1.22 is only blocked on dealing with certificate
generation for the custom webhooks[6]. For this, I'd like to get
feedback on the approach (continue to manually sign certificates or
introduce cert-manager to automate that). Looking further for the k8s
versions, 1.23 will be fairly simple I think and 1.24 will require
migrating the cluster from Docker to containerd which I'd like to pair
with a bullseye upgrade.
Once we have an object storage service I'd like to look a bit more into
providing a logging solution that doesn't use NFS.
[5]:
https://gerrit.wikimedia.org/r/c/cloud/toolforge/kube-container-updater/+/8…
[6]: https://phabricator.wikimedia.org/T286856
metricsinfra
No recent development here. I think we could roll out Prometheus
scraping to all projects and instances with the current infra, but for
that someone would need to sort out how to deal with security groups
with the pull model Prometheus uses. Some discussion about this is in
Phabricator[7].
Second thing next up in the metricsinfra road map is building an API to
let projects manage their scraping rules and alerts. I'd like to
integrate that with Terraform at some point.
[7]: https://phabricator.wikimedia.org/T288108
Puppet ENC service
Planning to do some work[8] on the ENC API service, mostly to make it
work with Terraform. Most notably the Git integration will be moved from
the Horizon dashboard to the API service itself.
[8]: https://phabricator.wikimedia.org/T317478
ToolsDB
No recent developments here either. :(