Hi all,
I do appreciate the efforts to keep toolforge running, and that sometimes massive changes are necessary to do this, which has implications for tool maintainers.
I also understand that there have to be deadlines at some point, otherwise things will never get finished.
But as I have said on Phabricator (can't find the ticket now), I have been active in moving things to k8s from early on; I have literally rewritten enormous codebases (eg Mix'n'match) in a different language, because the k8s approach does not support the way I did things with grid engine. And while I think the new code is an improvement over the old one, it has taken a huge amount of my time to do this, with little visible improvement for the end user.
K8s, as it's run right now on toolforge, can not
- use fire-and-forget jobs, because everything needs a name that you may or may not re-use
- has very limited per-tool resources, and the webservice reduces those even further
- can not temporarily scale up. Eg I need to process a lot of data once; on grid engine, I could just fire off all the jobs, wait for them to complete, re-run the failed ones etc. This is simply not possible on k8s as it is.
I know there is a technical reason to limit per-tool k8s resources so much (something about running on a single VM), but IMHO there needs to be a lot more flexibility; give the user the option to scale up tool resources without having to go through Phab bureaucracy, run jobs on a large (shared) k8s pool, auto-generate job names for fire-and-forget jobs, something.
As for the deadline(s) given here, as I stated above, I started quite early on this, and invested a lot of work. Yet, I still have tools listed on
https://grid-deprecation.toolforge.org/ (which was not linked from the original mail, despite being the main link people need IMHO), so I do feel the pressure myself. Maybe you could disable grid engine for all tools NOT on that page, to ensure no one restarts with grid engine, and leave a smaller pool running for the remaining tools, to make resources available for k8s while giving the remaining tool users a bit more time?
Apologies for long rant,
Magnus