Toolforge just now suffered a partial grid-engine outage. All grid
services should be back to normal as of this email; some k8s services
may misbehave for the next hour or two.
NFS misbehavior resulted in grid control mechanisms timing out, which
meant that no new jobs could be scheduled for the last 90 minutes or so.
We've rebooted the NFS server which has resolved the primary issues;
however, rebooting NFS is itself disruptive and may have caused other
jobs (both on the grid or in k8s) to fail.
We're currently rebooting all k8s worker nodes, which will take a couple
of hours to complete. During those reboots some jobs may fail or
experience surprise rescheduling.
Sorry for the outage! If your grid job was disrupted by this outage,
please take this as a sign to migrate your service off the grid! Details
about the grid shutdown can be found here:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
-Andrew (+ Taavi who did most of the actual recovery work)
Hello!
After our initial announcement of the Grid Engine shutdown timeline[0],
some of you raised concerns about losing your tools.
We want to address those apprehensions while hopefully providing
reassurance. No tools will be deleted until the grid engine shutdown date
on 14 February 2023. However, for tools with unreachable maintainers, an
outage will happen starting on 14 December 2023[1]. This is intended to
raise awareness for users or maintainers who have not otherwise been
reached. A list of these tools can be found here[2]. If you are a
maintainer or a user of a tool in this list, comment on the associated
phabricator ticket with migration plans or a request for more support. The
goal is to have a plan for all tools running on the grid. We want all
actively used tools to be migrated, and will help support users of critical
tools without a maintainer. Thanks for your help in identifying and
migrating those tools you maintain and depend on.
We acknowledge that the timeline might seem tight, and we want to clarify
that our approach is to make this process as seamless as possible. We have
been actively engaging with tool maintainers over the past year, and we
genuinely appreciate the efforts many of you have already made to migrate
your tools to Kubernetes.
We will continue to work closely with maintainers who might need additional
time or assistance.
If for any reason you have not received a phabricator ticket for your tool,
please reach out.
The phabricator ticket is a good place to communicate your needs and plans
for any remaining tools or jobs.
This will help us further organize and plan this process.
Our primary goal is to support you through this transition. If you have
further concerns about the deadline or if you need assistance with the
migration process, please don't hesitate to reach out to us. We are
available on IRC, Telegram, Phabricator[3], and through our other support
channels[4].
Do you still have concerns or questions? Please let us know. We want to do
this together with you, in a way which makes sense to everyone. We’re very
grateful for all the hard work you do, and our only goal here is to secure
the future of tools in the Wikimedia sphere, not to make your lives more
difficult.
Thank you!
[0]:
https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.…
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[2]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation/…
[3]: https://phabricator.wikimedia.org/project/board/6135/
[4]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
We are experiencing networking issues on Cloud VPS, which means
currently no traffic is getting in or out of Cloud VPS. Toolforge is
also down.
We are working on it and progress is tracked at
https://phabricator.wikimedia.org/T352539
We will send an update when things are working again, thanks for your patience.
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
Later today, I am upgrading our OpenStack deployment from version Zed to
Antelope. [1]
Expect Cloud VPS to be partially unstable: horizon.wikimedia.org will show
a maintenance message and API calls might fail.
You can follow the upgrade details at
https://phabricator.wikimedia.org/T348843 and on IRC
(#wikimedia-cloud-admin).
[1] https://releases.openstack.org/antelope/
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
Hello, all!
Starting today we are kicking off the process to shut down Grid Engine and
we want to share the timeline with you.
== Background ==
WMCS made the Grid Engine available as a backend engine for hosting tools
on Toolforge - our Platform as a Service(PaaS) offering.
An additional backend engine, Kubernetes, was also made available on
Toolforge.
Over time, maintaining and securing the grid has proven to be difficult and
making it harder to provide support to the community in other ways because
a lot of man-hours of maintenance work is spent on this.
This is mainly due to the fact that there has been no new Grid Engine
releases (bug fixes, security patches, or otherwise) since 2016.[0]
Maintenance work on the grid continued because it was widely popular with
the community and the Kubernetes offering didn't yet have many grid-like
features that contributors came to love.
Once the Kubernetes platform could handle many of the workloads, we started
the grid deprecation process by asking maintainers to migrate off the
grid.[1]
Over the past year, we've been reaching out to our tool maintainers and
working with them to migrate their tools off the Grid to Kubernetes.
We have reached out directly to all maintainers with their phabricator
ticket IDs.
The latest updates to Build Service[2] have addressed many of the issues
that prevented tool maintainers from migrating.
== Initial Timeline ==
The detailed grid shutdown timeline is available on wiki.[3] The important
dates have been copied below.
* 14th December, 2023: Any maintainer who has not responded on phabricator
will have tools shutdown and crontabs commented out. Please plan to migrate
or tell us your plans on phabricator before that date.
* 14th February, 2024: The grid is completely shut down. All tools are
stopped.
If you need further clarification or help migrating your tool, don't
hesitate to reach out to us on IRC, Telegram, Phabricator[4] or via any of
our support channels.[5]
Thank you.
[0]: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/
[1]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation
[2]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
[3]:
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#…
[4]: https://phabricator.wikimedia.org/project/profile/6135/
[5]:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Commun…
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hello!
The 2022 Cloud Services results have been published!
We had 159 participants who responded and provided valuable feedback and
suggestions.
For the first time, we moved from Google Forms to using LimeSurvey.
Some of you have long requested for this change and we will continue to use
LimeSurvey going forward.
The publication of the results have delayed but it's finally here:
https://meta.wikimedia.org/wiki/Research:Cloud_Services_Annual_Survey/2022
Thanks to everyone who participated and provided input and comments!
We will launch the 2023 Cloud Services survey next month!
Thank you!
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hi,
Toolforge's Harbor instance will briefly be down for a version upgrade from
2.5 to 2.9 this Wednesday at 8:00 UTC.
https://phabricator.wikimedia.org/T346241
<https://phabricator.wikimedia.org/T346241>
This should not affect any tools that are not using the new build service,
nor any tools that are already running.
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
If you are using the builds service, you will not be able to run any new
builds, or start a job or a webservice from an image built with the
build service while Harbor is down.
We will send an update before starting maintenance work, and once
everything is back up and running.
--
Slavina Stefanova (she/her)
Software Engineer | Developer Experience
Wikimedia Foundation
Hi!
There will be a small network interruption next Monday at around 13:00 UTC
as we will be doing some cleanup on the Openstack after network
re-architecture (see https://phabricator.wikimedia.org/T348140).
It will affect all the CloudVPS and other services hosted there (including
toolforge, PAWS, quarry and superset). VMs traffic to the internet will be
cut for a short period, hopefully for a few seconds, while internal traffic
will not be affected, but if you have any open ssh session to your VMs or
login.toolforge.org, it might timeout or get dropped, and any web access to
projects will not work during the downtime.
We will update on this email thread when we start and when we have finished.
This will help stabilize the network and avoid bigger outages in the
future, so thanks for your patience!
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
Hello,
The Toolforge admin team is happy to announce that the Toolforge Build
Service[0] is now available in open beta.
The Build Service is intended to allow more tools to migrate off the Grid
Engine and to make the process for deploying code to
Toolforge easier and more flexible, by building container images with the
specific dependencies for each tool.
Here are quick highlights of some of the current key features:
1. Build your tool from source code, using you language's dependency
management tool, no dockerfiles, no scripts, no manual steps
2. Use industry-wide standards[1] no vendor lock-in by using upstream
buildpacks
3. Support for many languages out of the box[2]
4. Envvars - Create and manage environment variables and secrets that are
available at runtime.[3]
5. Ability to install packages from the Ubuntu repositories[4]
6. Improved resiliency and resource usage by allowing NFS-less
webservices[5], if you don't need NFS
7. Test your image locally, or anywhere[6]
Please review the current known limitations here[7]
We also have a growing list of tutorials for various languages[8]
During this open beta, we invite you to actively participate and share your
feedback replying to this thread or through irc, and if you
find any issues or have any feature suggestions, you can use this task
template[9].
Your insights will help us enhance and tailor the Build Service to meet the
needs of your tools.
The plan is to have this phase run for the next months, and if no big
issues are found, promote it to global availability phase 1 (GA1)
while we work on adding automatic triggering and deployment, for which we
will do a second round of beta testing for those specific features.
This unblocks the last step to migrate out of the grid, so we request all
grid users to give it a try and report any issues they might find,
there's no big changes expected for the currently implemented features, so
any work done now will help later.
Thank you for being a part of this journey. We look forward to your
invaluable feedback and collaboration as we strive to provide a better
developer experience.
[0]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
[1]: https://buildpacks.io/
[2]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Supported_…
[3]:https://wikitech.wikimedia.org/wiki/Help:Toolforge/Envvars_Service
[4]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Installing…
[5]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Using_NFS_…
[6]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Testing_lo…
[7]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Known_curr…
[8]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Tutorials_…
[9]: https://w.wiki/7kpi
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services