Hi all,
I do appreciate the efforts to keep toolforge running, and that sometimes
massive changes are necessary to do this, which has implications for tool
maintainers.
I also understand that there have to be deadlines at some point, otherwise
things will never get finished.
But as I have said on Phabricator (can't find the ticket now), I have been
active in moving things to k8s from early on; I have literally rewritten
enormous codebases (eg Mix'n'match) in a different language, because the
k8s approach does not support the way I did things with grid engine. And
while I think the new code is an improvement over the old one, it has taken
a huge amount of my time to do this, with little visible improvement for
the end user.
K8s, as it's run right now on toolforge, can not
- use fire-and-forget jobs, because everything needs a name that you may or
may not re-use
- has very limited per-tool resources, and the webservice reduces those
even further
- can not temporarily scale up. Eg I need to process a lot of data once; on
grid engine, I could just fire off all the jobs, wait for them to complete,
re-run the failed ones etc. This is simply not possible on k8s as it is.
- Even the current Wikitech documentation still uses grid engine, eg
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Rust (I have tried, and
failed, to get that running on k8s)
I know there is a technical reason to limit per-tool k8s resources so much
(something about running on a single VM), but IMHO there needs to be a lot
more flexibility; give the user the option to scale up tool resources
without having to go through Phab bureaucracy, run jobs on a large (shared)
k8s pool, auto-generate job names for fire-and-forget jobs, something.
As for the deadline(s) given here, as I stated above, I started quite early
on this, and invested a lot of work. Yet, I still have tools listed on
https://grid-deprecation.toolforge.org/ (which was not linked from the
original mail, despite being the main link people need IMHO), so I do feel
the pressure myself. Maybe you could disable grid engine for all tools NOT
on that page, to ensure no one restarts with grid engine, and leave a
smaller pool running for the remaining tools, to make resources available
for k8s while giving the remaining tool users a bit more time?
Apologies for long rant,
Magnus
On Tue, Dec 5, 2023 at 7:13 PM <meta.sj(a)gmail.com> wrote:
Hello, this is a provocative approach to migration!
* Clarification Q: The timeline was finalized on November 28, and any
tools and cronjobs w/ no response from maintainers will stop working on the
same day on December 14th, regardless of how actively they are used? Is
there anything that users of those tools can do to delay this? It might be
worth posting in places where active /users/ of tools hang out, not just
the maintainers, as they will be inconvenienced and may be able to share
maintainership where needed.
* Can you share stats on how many tools remain to be migrated, how many
will stop in December, and which are the most-used? This phab board has
~460 open tasks
https://phabricator.wikimedia.org/project/board/6135/query/open/ ,
some created this week by the maintainers after receiving a recent ping
https://phabricator.wikimedia.org/T352564 , while
https://grid-deprecation.toolforge.org/ lists only 447 tools still
running on GE -- scores of which seem quite popular.
* When "grid infrastructure is deleted" on March 14, will there be backups
of the tools for people who want to migrate them in the future?
* At least Maarten and Albin asked to be unassigned from migration tasks
for their tools (but remain assigned). If they can't unassign themselves,
and users need to coordinate finding migrators for their tools to keep
working: is there some other way to flag in Phab which tools need someone
to work on migration? Ideally a way visible from taskboard overviews...
Cordially, Sam
_______________________________________________
Cloud mailing list -- cloud(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/