Hello, all!
This email contains valuable information about the Toolforge service.
Starting today, we're initiating a process to migrate away from Debian Stretch to Debian Buster for all of Toolforge servers, and the most affected piece is the Grid Engine backend in particular.
Debian Stretch was released in June 2017, and long term support for it (including security updates) will cease in June 2022. We need to shut down all Stretch hosts before the end of support date to ensure that Toolforge remains a secure platform. This migration will take several months because many people still use the Stretch hosts and our users are working on tools in their spare time.
You should be aware that our ultimate goal is to deprecate Grid Engine entirely and replace it with Kubernetes. Read below for more information on this.
== Initial timeline == Subject to change, see Wikitech[1] for living timeline.
* 2022-02-15: Availability of Debian Buster grid announced to community * 2022-03-21: Weekly reminders via email to tool maintainers for tools still running on Stretch * Week of 2022-04-21: ** Daily reminders via email to tool maintainers for tools still running on Stretch ** Switch login.toolforge.org to point to Buster bastion * Week of 2022-05-02: Evaluate migration status and formulate plan for final shutdown of Stretch grid * Week of 2022-05-21: Shut down Stretch grid
== What is changing? == * New bastion hosts running Debian Buster with connectivity to the new job grid * New versions of PHP, Python3, and other language runtimes * New versions of various support libraries
== What should I do? == You should migrate your Toolforge tool to a newer environment. You have two options: * migrate from Toolforge Stretch Grid Engine to Toolforge Kubernetes[3]. * migrate from Toolforge Stretch Grid Engine to Toolforge Buster Grid Engine.
The Cloud Services team has created the Toolforge Stretch deprecation[0] page on wikitech.wikimedia.org to document basic steps needed to move web services, cron jobs, and continuous jobs from the old Stretch grid to the new Buster grid. That page also provides more details on the language runtime and library version changes and will provide answers to common problems people encounter as we find them. If the answer to your problem isn't on the wiki, ask for help using any of our communication channels[2].
We encourage you to move to Kubernetes today if you can, see below for more details.
For those who can't migrate to Kubernetes, the Debian Buster grid should be adopted within the next three months.
== A note on the future of Toolforge, the Grid and Kubernetes == As of today, Toolforge is powered by both Grid Engine and Kubernetes. For a number of reasons, we have decided to deprecate Grid Engine and replace all of its functions with Kubernetes. We're not yet ready to offer all grid-like features on Kubernetes, but we're working on it. As soon as we are able, we will begin the process of migrating the workloads and shutting down the grid. This is something we hope to do between 2022 and 2023.
We share this information to encourage you to evaluate migrating your tool away from Grid Engine to Kubernetes.
One of the most prominent missing features on Kubernetes was a friendly command line interface to schedule jobs (like jsub). We've been working on that, and have a beta-level interface that you can try today: the Toolforge jobs framework [4].
[0]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation [1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Timel... [2]: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Communi... [3]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes [4]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs-Framework
Thanks.
Hi,
Why are we upgrading to Buster instead of Bullseye? According to https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy Buster will be end of life around August this year. So we're either stuck with an older version for a while or we have to do this whole exercise again much sooner than we would like. Can you explain?
Maarten
On 15-02-2022 17:42, Seyram Komla Sapaty wrote:
Hello, all!
This email contains valuable information about the Toolforge service.
Starting today, we're initiating a process to migrate away from Debian Stretch to Debian Buster for all of Toolforge servers, and the most affected piece is the Grid Engine backend in particular.
Debian Stretch was released in June 2017, and long term support for it (including security updates) will cease in June 2022. We need to shut down all Stretch hosts before the end of support date to ensure that Toolforge remains a secure platform. This migration will take several months because many people still use the Stretch hosts and our users are working on tools in their spare time.
You should be aware that our ultimate goal is to deprecate Grid Engine entirely and replace it with Kubernetes. Read below for more information on this.
== Initial timeline == Subject to change, see Wikitech[1] for living timeline.
- 2022-02-15: Availability of Debian Buster grid announced to community
- 2022-03-21: Weekly reminders via email to tool maintainers for tools
still running on Stretch
- Week of 2022-04-21:
** Daily reminders via email to tool maintainers for tools still running on Stretch ** Switch login.toolforge.org http://login.toolforge.org to point to Buster bastion
- Week of 2022-05-02: Evaluate migration status and formulate plan for
final shutdown of Stretch grid
- Week of 2022-05-21: Shut down Stretch grid
== What is changing? ==
- New bastion hosts running Debian Buster with connectivity to the new
job grid
- New versions of PHP, Python3, and other language runtimes
- New versions of various support libraries
== What should I do? == You should migrate your Toolforge tool to a newer environment. You have two options:
- migrate from Toolforge Stretch Grid Engine to Toolforge Kubernetes[3].
- migrate from Toolforge Stretch Grid Engine to Toolforge Buster Grid
Engine.
The Cloud Services team has created the Toolforge Stretch deprecation[0] page on wikitech.wikimedia.org http://wikitech.wikimedia.org to document basic steps needed to move web services, cron jobs, and continuous jobs from the old Stretch grid to the new Buster grid. That page also provides more details on the language runtime and library version changes and will provide answers to common problems people encounter as we find them. If the answer to your problem isn't on the wiki, ask for help using any of our communication channels[2].
We encourage you to move to Kubernetes today if you can, see below for more details.
For those who can't migrate to Kubernetes, the Debian Buster grid should be adopted within the next three months.
== A note on the future of Toolforge, the Grid and Kubernetes == As of today, Toolforge is powered by both Grid Engine and Kubernetes. For a number of reasons, we have decided to deprecate Grid Engine and replace all of its functions with Kubernetes. We're not yet ready to offer all grid-like features on Kubernetes, but we're working on it. As soon as we are able, we will begin the process of migrating the workloads and shutting down the grid. This is something we hope to do between 2022 and 2023.
We share this information to encourage you to evaluate migrating your tool away from Grid Engine to Kubernetes.
One of the most prominent missing features on Kubernetes was a friendly command line interface to schedule jobs (like jsub). We've been working on that, and have a beta-level interface that you can try today: the Toolforge jobs framework [4].
Thanks.
-- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.o...
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
On 2/15/22 21:46, Maarten Dammers wrote:
Hi,
Why are we upgrading to Buster instead of Bullseye? According to https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy Buster will be end of life around August this year. So we're either stuck with an older version for a while or we have to do this whole exercise again much sooner than we would like. Can you explain?
Hi there,
Legit question. I'm happy to elaborate:
* this was all discussed back in September 2021 in phabricator, see https://phabricator.wikimedia.org/T277653#7378774 and https://phabricator.wikimedia.org/T277653#7381146. Our conclusion was to don't skip Buster.
* we are hoping that there wont be a Buster->Bullseye migration for the grid. Hopefully by the time we need to remove Buster the Kubernetes backend will be 100% suitable solution for every tool.
* this migration work started before Debian Bullseye was released, with our intention being to complete it before the release. For a couple of reasons the project was delayed.
* in the grid case, the engineering effort to do a N+1 upgrade is lower than doing a N+2 upgrade. If we had tried a N+2 upgrade directly, things would have been much slower and difficult for us.
Your concern about doing the migration dance twice is 100% valid, and the only way to future-proof your tool is to remove dependency on GridEngine and migrate it to the Kubernetes backend.
regards.
From my perspective of a Toolforge user, one of the issues I see is that it's often not clear how to map the "friendly command line interface" into concepts I already understand about the lower level tools.
For example, the webservice script does some useful stuff. But, it wasn't clear exactly what it was doing, i.e. there was a lot of magic happening. While the magic is certainly an integral part of hiding the low-level details, it also obfuscates things. Reading the webservice script wasn't much help; it's long and complicated, and mixes grid and k8s functionality in a way that further hides what's actually going on.
Anyway, all I'm really asking is that as the docs get written for the "friendly command line interface", you also include some explanation of what's happening behind the scenes. For example, maybe have a --verbose option to all the tools which makes it print all the back end commands it's executing, so
webservice --backend=kubernetes python3.7 restart
might print:
kubectl exec -i -tshell-1645020371 --container main-app -- /bin/bash
And then somebody who already understands kubectl would instantly understand what's happening. It's not hard to guess the basic gist of what it must be doing, but having the details confessed eliminates any doubt, enhancing comprehension.
As another example, it took me a little bit to figure out that the "become" command doesn't do anything more magic than run sudo with a little sanity checking wrapped around it. Fortunately, that script is simple enough that once I looked at it, it was obvious what it was doing. But other parts of the "friendly command line interface" are rather more opaque.
On Feb 15, 2022, at 11:42 AM, Seyram Komla Sapaty ssapaty@wikimedia.org wrote:
One of the most prominent missing features on Kubernetes was a friendly command line interface to schedule jobs (like jsub). We've been working on that, and have a beta-level interface that you can try today: the Toolforge jobs framework [4].
I can see already that trying to migrate to this new toolforge-jobs framework is going to be a long and winding road.
First of all, the documentation on "choosing the execution runtime" leaves much to be desired. There is a list of 23 available runtimes to choose from, but no guidance on how to choose, except what one can guess from the names.
From my initial attempts, I have determined that the default "tf-bullseye-std" runtime does not have python3 available, so it is not suitable for running python-based tools (including Pywikibot). The "tf-python39" runtime does not have php available. But I have some jobs that run _both_ python3 and php scripts. Is there any runtime that will accommodate this?
Also, it is not possible to load Pywikibot in the tf-python39 runtime because a required module (requests, from https://python-requests.org) is not available. What is the process for requesting (no pun intended) that this (or any other resource) be added to the image?
TIA
On 2/16/22 17:34, Russell Blau wrote:
Also, it is not possible to load Pywikibot in the tf-python39 runtime because a required module (requests, fromhttps://python-requests.org) is not available. What is the process for requesting (no pun intended) that this (or any other resource) be added to the image?
See some documentation here:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python#Kubernetes_python_...
I just created it, and may need some polishing, but it should work!
We will review pywikibot specific workflows and documents soon.
regards.
I recently migrated most of pywikibot jobs from grid engine to k8s and to my surprise, it was actually quite easy. I have lots of tasks (been running bots since 2008). So much that it showed redaction in the total number of jobs in SGE.
One thing that helped me that I collapsed everything into a bash file, like hourly.sh and made that a job (it's better to have one pod doing a batch of work as overhead of creating a pod is bigger than a SGE job)
Also, make sure your bash file is readable by others (see the doc, chmod ug+x hourlysh), that would have saved me a bit of time.
Thank you for doing this. I use only the new fancy infrastructure from now on ^^
On Fri, Feb 18, 2022 at 8:28 PM Arturo Borrero Gonzalez < aborrero@wikimedia.org> wrote:
On 2/16/22 17:34, Russell Blau wrote:
Also, it is not possible to load Pywikibot in the tf-python39 runtime
because a required module (requests, fromhttps://python-requests.org) is not available. What is the process for requesting (no pun intended) that this (or any other resource) be added to the image?
See some documentation here:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python#Kubernetes_python_...
I just created it, and may need some polishing, but it should work!
We will review pywikibot specific workflows and documents soon.
regards.
Arturo Borrero Gonzalez Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
I too use bash files (see https://github.com/PersianWikipedia/fawikibot/tree/master/HujiBot/grid/jobs) and am planning to move them from grid to k8s in the next few weeks. I have found the process well-documented and am hoping for the transition to be easy, like Amir said.
Since at least two people are using this approach, once I am done with my transition I might go to Wikitech and add some helpful hints for other users.
On Thu, Mar 10, 2022 at 5:03 AM Amir Sarabadani ladsgroup@gmail.com wrote:
I recently migrated most of pywikibot jobs from grid engine to k8s and to my surprise, it was actually quite easy. I have lots of tasks (been running bots since 2008). So much that it showed redaction in the total number of jobs in SGE.
One thing that helped me that I collapsed everything into a bash file, like hourly.sh and made that a job (it's better to have one pod doing a batch of work as overhead of creating a pod is bigger than a SGE job)
Also, make sure your bash file is readable by others (see the doc, chmod ug+x hourlysh), that would have saved me a bit of time.
Thank you for doing this. I use only the new fancy infrastructure from now on ^^
On Fri, Feb 18, 2022 at 8:28 PM Arturo Borrero Gonzalez < aborrero@wikimedia.org> wrote:
On 2/16/22 17:34, Russell Blau wrote:
Also, it is not possible to load Pywikibot in the tf-python39 runtime
because a required module (requests, fromhttps://python-requests.org) is not available. What is the process for requesting (no pun intended) that this (or any other resource) be added to the image?
See some documentation here:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python#Kubernetes_python_...
I just created it, and may need some polishing, but it should work!
We will review pywikibot specific workflows and documents soon.
regards.
Arturo Borrero Gonzalez Site Reliability Engineer Wikimedia Cloud Services Wikimedia Foundation _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
-- Amir (he/him)
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/