I will be upgrading the cloud-vps openstack install on Thursday,
beginning around 16:00 UTC. Here's what to expect:
- Intermittent Horizon and API downtime (maybe an hour or two total)
- Inability to schedule new VMs (also for an hour or two)
Toolforge users will be unaffected by this outage. Existing, running
services and VMs on cloud-vps should also be unaffected.
In case you want to follow along at home, this is tracked as
https://phabricator.wikimedia.org/T356287
-Andrew + the WMCS team
I've swapped the secondary dev.toolforge.org bastion to a new server
running Debian 12. As usual, the new SSH fingerprints have been
published on Wikitech[0].
The new bastion no longer has the full list of packages installed that
were required for Grid Engine usage. If there is a package missing
from the new bastion that you would find useful, please file a new
Phabricator task in the Toolforge project[1].
If there are no major issues found I will also swap the main
login.toolforge.org bastion to a new server in a few days. I'll send a
separate announcement when that happens.
[0]: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/dev.toolforge.org
[1]: https://phabricator.wikimedia.org/tag/toolforge/
Taavi
--
Taavi Väänänen (he/him)
Site Reliability Engineer, Cloud Services
Wikimedia Foundation
TL;DR: If you start to notice new or noisy puppet failures on your VMs,
please notify me directly or open a phab ticket and assign it to me
(Andrew).
==
What's happening:
Over the last few weeks I've been upgrading cloud-vps puppet servers to
newer builds that support the latest version of the puppet config
language, version 7. That's done for almost all cases; there are a few
project-local puppetmasters that I've been nervous about messing with
directly; in those cases I've opened phabricator tickets and assigned
them to project admins. For clarity, I've been using 'puppetserver'
terminology for new servers, whereas older servers were generally called
'puppetmasters.' [0]
Now that most servers are upgraded, it's time for me to flip the setting
that causes them to actually use the version 7 parser and compiler. In
almost all cases this will be backwards-compatible with the existing
catalogs but we may turn up a few edge cases that require repair.
What you need to do:
If you have one of those phab tickets about puppetservers open for your
project, please respond on the ticket so I know you're there and know
what your plan is.
All other users, please reach out to me if you start seeing new or
surprising puppet failures and I'll help sort out the transition.
-Andrew
[0] https://wikitech.wikimedia.org/wiki/Help:Project_puppetserver
Hi all!
This is to let you know that Toolforge continuous jobs now support
health-checks!
To use it you need to provide `--health-check-script ./script.sh` while
creating
your job. You can also provide the script as a string like this
`--health-check-script "cat /etc/os-release"`. Toolforge will periodically
attempt
to execute your health-check script inside your running job and will
restart
your job if the script completes with an exit code of 1.
Note: if you use a script file for health-check, do not forget to make the
file
executable (chmod u+x script.sh). If toolforge can't execute your
health-check
script, your job will never start.
Also a reminder that you can find this and smaller user-facing updates about
the Toolforge platform features here:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog
Original task: https://phabricator.wikimedia.org/T335592
--
Ndibe Raymond Olisaemeka
Software Engineer - Technical Engagement
Wikimedia Foundation <https://wikimediafoundation.org/>
<https://wikimediafoundation.org>
Hi,
Toolforge's Harbor instance (image registry) will be down briefly for a
version upgrade from 2.9.0 to 2.10.1 tomorrow Thursday 4 April at 9:00 UTC.
https://phabricator.wikimedia.org/T354507
This should not affect any tools that are not using the new build service,
nor any tools that are already running.
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
If you are using the builds service, you will not be able to run any new
builds, or start a job or a webservice from an image built with the build
service while Harbor is down. The outage is expected to last a few minutes.
We will send an update before starting maintenance work, and once
everything is back up and running.
Cheers,
--
Slavina Stefanova (she/her)
Software Engineer | Developer Experience
Wikimedia Foundation
Hello!
In order to conserve resources and prevent bot-net hijacking, cloud-vps
users have a few maintenance responsibilities. This spring two of these
duties have come due: an easy one and a hard one. Tl;dr: visit
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge, claim
your projects, and replace any hosts still running Debian Buster.
-- #1: Claim your projects --
This one is easy. Please visit the following wiki page and make a small
edit in your project(s) section, indicating whether you are or aren't
still using your project:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge
this serves a couple of purposes. It allows us to identify and shut down
abandoned or no-longer-useful projects, it provides us with some updated
info about who cares about a given project (often useful for future
contact purposes) and it increases visibility into projects that are
used but unmaintained.
Regarding that last item: if you know that you depend on a project but
are not an admin or member of that project, please make a note of that
on the above page as well!
-- #2: Replace Debian Buster --
This one may require some work. Long term support for the Debian Buster
OS release is quickly running out (ending June 30), so VMs running
Buster need to be replaced with hosts running a new Debian version. You
may or may not be responsible for Buster instances; you can see a break
down of remaining Buster hosts on either of these pages:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2024_Purge (you
should be visiting that page anyway, because of item 1)
https://os-deprecation.toolforge.org/
More details about this process can be found here:
https://wikitech.wikimedia.org/wiki/News/Buster_deprecation
Typically in-place upgrades of VMs don't work all that well, so my
advice is to start fresh with a new server running Bookworm and to
migrate workloads to the new host. I've found Cinder volumes to be a big
help in this process; once all of your persistent data and config is in
a detachable volume it's fairly straightforward to move and will make
future upgrades that much easier.
WMCS staff will be standing by to help with any quota changes you might
need to help with this move; you can open a quota request ticket at
https://phabricator.wikimedia.org/project/view/2880/ -- and, as always,
we'll do our best to support you on IRC and on the cloud mailing list.
Thank you for your support and attention!
-Andrew + the WMCS team
Quarry will move to k8s on Monday 2024-04-01. Part of this is going to
involve exporting and importing the database, as well as syncing the NFS.
To this end there may be some data loss of any queries run in the cutover
time. As always don't rely on quarry to save queries, keep any important
queries local to your system and copy them into quarry.
Thank you
--
*Vivian Rook (They/Them)*
Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello!
today we are upgrading Toolforge kubernetes to version 1.24.
We are not expecting any outage, but some jobs and webservices may be
automagically restarted as they get scheduled on different worker nodes.
Please report any disruption that you may observe.
This is tracked in phabricator: https://phabricator.wikimedia.org/T307651
regards.
As of 2024-03-14T11:02 UTC the Toolforge Grid Engine service has been
shutdown.[0][1]
This shutdown is the culmination of a final migration process from
Grid Engine to Kubernetes that started in in late 2022.[2] Arturo
wrote a blog post in 2022 that gives a detailed explanation of why we
chose to take on the final shutdown project at that time.[3] The roots
of this change go back much further however to at least August of 2015
when Yuvi Panda posted to the labs-l list about looking for more
modern alternatives to the Grid Engine platform.[4]
Some tools have been lost and a few technical volunteers have been
upset as many of us have striven to meet a vision of a more secure,
performant, and maintainable platform for running the many critical
tools hosted by the Toolforge project. I am deeply sorry to each of
you who have been frustrated by this change, but today I stand to
celebrate the collective work and accomplishment of the many humans
who have helped imagine, design, implement, test, document, maintain,
and use the Kubernetes deployment and support systems in Toolforge.
Thank you to the past and present members of the Wikimedia Cloud
Services team. Thank you to the past and present technical volunteers
acting as Toolforge admins. Thank you to the many, many Toolforge tool
maintainers who use the platform, ask for new capabilities, and help
each other make ever better software for the Wikimedia movement. Thank
you to the folks who who will keep moving the Toolforge project and
other technical spaces in the Wikimedia movement forward for many,
many years to come.
[0]: https://sal.toolforge.org/log/DrOgPI4BGiVuUzOd9I1b
[1]: https://wikitech.wikimedia.org/wiki/Obsolete:Toolforge/Grid
[2]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[3]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/
[4]: https://lists.wikimedia.org/pipermail/labs-l/2015-August/003955.html
Bryan, on behalf of the Toolforge administrators
--
Bryan Davis Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Hello all,
We are on the last stretch of the grid engine deprecation process[0] and
this means that the grid will be shutting down on Thursday, the 14th of
March.
You can find a reminder of the full timeline here[1]
There's about 30 tools still running on the grid, if you are one of the few
left to migrate,
kindly ensure they are migrated before the 14th or reach out[2] to the team
if you are facing any challenges or need some assistance.
We have also reached out on phabricator and via email to the remaining
maintainers that still have their tools running on the grid to see if we
can help ease the migration or see if there are any blocking issues.
If you have a tool that is still on the grid and you cannot meet the above
deadline, kindly reach out via the tool migration phabricator ticket or our
support channels[2], note that this is a hard deadline and no extensions
would be granted but we might be able to help you do the transition.
We really appreciate all the effort and feedback given on the new platform,
this will help us improve our service and reduce the maintenance burden in
the long term for tool maintainers and toolforge admins alike.
[0]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[2]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services