PAWS will be upgrading to k8s 1.22 on 2023-01-31
If you were running a workload at that time it will need to be restarted.
--
*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi there,
The Toolforge jobs framework just got upgraded with a few new features:
* support for custom logs
* support for job failure retry policy
* new behavior with job image listing
* some initial validation of YAML files
The documentation should be mostly up-to-date in wikitech:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
You can stop reading here unless you want more details :-)
The custom log files feature will allow you do things like:
* using a custom directory to store log files
* merging stdout/stderr logs together into a single file
* ignoring one of the two log streams
The job retry policy allows to instruct the computing engine to restart jobs
that failed, up to 5 times.
Job images are now listed in a different format, and deprecated images are
hidden by default, to encourage usage of newer ones.
Regarding the YAML validation, the toolforge-jobs utility will now emit a
warning if some key is unknown. We plan to make this more robust in the future,
also providing a schema file.
We don't usually announce upgrades, but this one in particular contained much
awaited features. This is the result of hard work by several folks, in
particular Taavi (community member) and Raymond (WMF contractor).
Happy `toolforging`. Regards.
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
I will be upgrading the cloud-vps openstack install on Monday afternoon
my time (beginning around 18:00 UTC). Here's what to expect:
- Intermittent Horizon and API downtime (maybe an hour or two total)
- Inability to schedule new VMs (also for an hour or two)
- Some mild Horizon dashboard changes as I'll also be upgrading the
dashboards to version 'Zen'.
Toolforge users will be unaffected by this outage. Existing, running
services and VMs on cloud-vps should also be unaffected.
-Andrew + the WMCS team
Hello cloud-vps users!
It's time for our annual cleanup of unused projects and resources. Every
year or so the Cloud Services team tries to identify and clean up unused
projects and VMs. We do this via an opt-in process: anyone can mark a
project as 'in use,' and that project will be preserved for another year.
I've created a wiki page that lists all existing projects, here:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge
If you are a VPS user, please visit that page and mark any projects that
you use as {{Used}}. Note that it's not necessary for you to be a
project admin to mark something -- if you know that you're currently
using a resource and want to keep using it, go ahead and mark it
accordingly. If you /are/ a project admin, please take a moment to mark
which VMs are or aren't used in your projects.
When February arrives, I will shut down and begin the process of
reclaiming resources from unused projects.
If you think you use a VPS project but aren't sure which, I encourage
you to poke around on https://tools.wmflabs.org/openstack-browser/ to
see what looks familiar. Worst case, just email
cloud(a)lists.wikimedia.org with a description of your use case and we'll
sort it out there.
Exclusive toolforge users are free to ignore this email.
Thank you!
-Andrew and the WMCS team
Hi there,
the Toolforge jobs service [0] (the one you would use via the `toolforge-jobs`
command line interface) will have a brief maintenance today 2023-01-10 @ 11:30
UTC (in about 15 minutes).
We need to restart the API service and it will be down for a couple of minutes
(perhaps even less).
During that time, using the toolforge-jobs command line interface will most
likely fail.
regards.
[0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
On Tuesday 2023-01-17 PAWS will be moving k8s clusters.
As a result any running workloads or active sessions will stop and need to
be restarted.
https://phabricator.wikimedia.org/T326554
--
*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Due to Foundation holidays + personal time off, WMCS staff will be hard
to reach during the end of December and beginning of January.
We /may/ be reachable during serious outages or emergencies, but I
recommend a light touch on any existing, working projects. For those of
you who also have time off, I hope you have an enjoyable break be it at
the keyboard or away from it.
-Andrew + WMCS staff
While troubleshooting an infrastructure issue I just now accidentally
triggered a reboot of a few VMs, including the primary host of toolsdb.
If you see a little storm of alert in your tool logs about timeouts and
disconnections, that was what that was.
Everything should be back to normal now!
-Andrew
Hi there!
On 2022-11-28 and 2022-11-29 there has been some misleading emails being
sent: you may have receive one (or more) emails about puppet failures on
your Cloud VPS virtual machine.
Moreover, such emails were a bit contradictory, with messages like
"No failed resources", and "No exceptions happened".
There was a problem in the way the puppet errors were calculated that
has been now fixed [0].
This does not affect Toolforge.
sorry for the noise,
regards.
[0] https://gerrit.wikimedia.org/r/c/operations/puppet/+/861805/
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation
Hi there,
Today 2022-11-22 at about 12:25 UTC, as part of a routine operation I
reimaged/reformated a cloudvirt hypervisor without relocating all the
virtual machines first.
The data survived the reimage, but the 32 (!) affected virtual machines
were briefly unavailable and then hard-rebooted.
All virtual machines are now ACTIVE (up and running) from the openstack
point of view, but please, let me know if you need assistance recovering
them in any way.
As of this writing we don't have any automation to ensure we only
reimage empty hypervisors, but we're working on it, to prevent this kind
of human errors in the future.
regards. (and sorry!)
(!) Affected virtual machines are:
- ID: 78782628-4f9f-4263-84fc-06e767b3bfe1
Name: mx-wiki
- ID: 1fa9f0d9-42e8-4273-bdb1-a7d49998c13f
Name: synapse01
- ID: 2382fda0-e683-4d0c-95b6-bbbf323904d9
Name: canary1048-04
- ID: 4b570277-e51f-459d-bea2-394c5ad7bc92
Name: tools-sgeexec-10-16
- ID: 66529c1b-f3a3-4ff8-b30d-785f4f274965
Name: feature-store-test
- ID: e153f69a-46a0-458a-ab50-de3d86aa861b
Name: toolsbeta-test-k8s-worker-7
- ID: c3a2d1a9-f811-4da9-afba-3a113c8ff729
Name: wbregistry-02
- ID: 2b56c575-08a5-4def-87cb-bee5bd43e4f9
Name: prod
- ID: 141ac13c-f0fa-46d3-9d2a-cede8bc854c6
Name: devtools-puppetdb1001
- ID: fdb15c24-0b41-42d6-9c4a-82afd1d9dcb9
Name: tools-sgeweblight-10-31
- ID: 56e55a31-8d32-455e-b650-b7194e71d2fd
Name: runner-1023
- ID: cb4a87e4-264e-4c8f-8197-3efff54346de
Name: runner-1022
- ID: 5b6b5733-565d-456e-a4fc-85ce669d3fd2
Name: deployment-mdb02
- ID: 75dce76d-36ad-4f9e-85e9-8a11ff6744db
Name: wikibase-product-testing-2022
- ID: 868d3dca-3e5c-4089-89a9-2c7e756c3e31
Name: toolsbeta-cumin-1
- ID: 42ac6d8a-453a-4620-b4b7-9c97994c98fb
Name: integration-agent-docker-1030
- ID: 084da652-503d-49a7-9ffa-98a0cd5335fd
Name: toolsbeta-sgeexec-10-5
- ID: f098fe82-18b6-49a9-962d-9b8f1f989b14
Name: pcc-worker1001
- ID: 8eb272dc-8006-4e93-a966-5035809324d9
Name: deployment-mx03
- ID: e67d0e4c-e07c-4d9a-8ddb-cb0bc8efa388
Name: deployment-docker-api-gateway01
- ID: b958511a-10cb-4e62-bdbb-6da5013dd62f
Name: soweego
- ID: 62045cf9-59ed-44b9-a268-1c9f171b5aae
Name: tools-package-builder-04
- ID: 0127e905-f52e-4ed4-b60d-260102a8e625
Name: pontoon-lb-02
- ID: 827bf744-262a-458b-951d-f2e9a377e075
Name: toolsbeta-test-k8s-ingress-3
- ID: 3e6c31d7-b4db-4a5f-a610-a74d0013f631
Name: pki-test01
- ID: 8893ba32-fb5c-4567-a242-b6c676978b7d
Name: deployment-urldownloader03
- ID: f72e5b18-6376-4ccd-9e59-64447759e53f
Name: deployment-deploy03
- ID: 006dea0a-a1eb-4de3-bf45-1a071ad87152
Name: kafka-test-cloud-2
- ID: e05220d7-8ca1-4d9f-a933-01a843286ea8
Name: toolsbeta-docker-imagebuilder-01
- ID: 416f445a-cad4-45c2-b32e-f17100f93eac
Name: cloud-puppetmaster-05
- ID: 4e492051-25a3-4442-b8b9-1959f42917fe
Name: tools-k8s-worker-76
- ID: df18863a-2da7-4951-aa69-936b3d889592
Name: deployment-docker-cpjobqueue01
--
Arturo Borrero Gonzalez
Senior Site Reliability Engineer
Wikimedia Cloud Services
Wikimedia Foundation