Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host. Either deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your maintenance script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days after completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive label, such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
[1]: https://wikitech.wikimedia.org/wiki/Kubernetes/Kubectl
[2]: https://kubernetes.io/docs/reference/kubectl/quick-reference/
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working on migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes can automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist, etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE
Thanks Reuven! Two questions:
- Is there a Wikitech page with this information? (I did not find one)
- Is there a mwscript-k8s equivalent to Ctrl-c with the old style maintenance script runner, if you need to stop the script?
Kosta
On 26. Sep 2024 at 05:10:21, Reuven Lazarus rlazarus@wikimedia.org wrote:
Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host. Either
deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your maintenance
script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days after
completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately
begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to
mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not
outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive label,
such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and
view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working on
migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes can
automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist,
etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Both good questions, thank you!
On Fri, Sep 27, 2024 at 4:45 AM Kosta Harlan kharlan@wikimedia.org wrote:
Thanks Reuven! Two questions:
- Is there a Wikitech page with this information? (I did not find one)
I expect to have Wikitech updated today -- finishing up the text now, sorry about the delay.
- Is there a mwscript-k8s equivalent to Ctrl-c with the old style
maintenance script runner, if you need to stop the script?
Once the job is launched with mwscript-k8s, you can use Kubernetes's standard kubectl commands to interact with it. In this case, delete the job, which sends a SIGTERM to the running script.
$ kube_env mw-script-deploy codfw # Act as the deploy user to get delete privileges; use caution $ kubectl get job -l username=kharla # Look up the job name, if you don't have it handy $ kubectl delete job ${JOB_NAME}
mwscript-k8s prints out the job name when it starts -- it'll look like "mw-script.codfw.1234wxyz" with a random alphanumeric component at the end.
Note that deleting the job will not only terminate it, but, well, delete it. That includes deleting its saved logs -- so capture those first if you need to keep them.
On Fri, Sep 27, 2024 at 9:22 AM Reuven Lazarus rlazarus@wikimedia.org wrote:
On Fri, Sep 27, 2024 at 4:45 AM Kosta Harlan kharlan@wikimedia.org wrote:
- Is there a Wikitech page with this information? (I did not find one)
I expect to have Wikitech updated today -- finishing up the text now,
sorry about the delay.
A lightly-reformatted version of the same information is now posted at https://wikitech.wikimedia.org/wiki/Maintenance_scripts (formerly a redirect to Maintenance_server). Thanks for your patience.
Hi,
Thanks for the work on this.
I have the occasional need to run a maintenance script manually one-off for a set amount of time. I used the "timeout" command with "mwscript" to exit the script after a specified amount of time. However, I cannot see an easy way to achieve this, as the mwscript-k8s command exiting doesn't stop the execution of the maintenance script.
Thanks,
Dreamy Jazz / WBrown (WMF)
On Sat, 28 Sept 2024 at 01:36, Reuven Lazarus rlazarus@wikimedia.org wrote:
On Fri, Sep 27, 2024 at 9:22 AM Reuven Lazarus rlazarus@wikimedia.org wrote:
On Fri, Sep 27, 2024 at 4:45 AM Kosta Harlan kharlan@wikimedia.org wrote:
- Is there a Wikitech page with this information? (I did not find
one)
I expect to have Wikitech updated today -- finishing up the text now,
sorry about the delay.
A lightly-reformatted version of the same information is now posted at https://wikitech.wikimedia.org/wiki/Maintenance_scripts (formerly a redirect to Maintenance_server). Thanks for your patience. _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
On Mon, Sep 30, 2024 at 5:55 AM Dreamy Jazz dreamyjazzwikipedia@gmail.com wrote:
I have the occasional need to run a maintenance script manually one-off for a set amount of time. I used the "timeout" command with "mwscript" to exit the script after a specified amount of time. However, I cannot see an easy way to achieve this, as the mwscript-k8s command exiting doesn't stop the execution of the maintenance script.
Hi Dreamy,
I don't think this would be too hard to implement.
I think we could plumb through a new CLI argument for mwscript-k8s that, if set, would set the .spec.activeDeadlineSeconds field on the k8s Job object. It would be a small patch to the Helm chart and to the Python script.
The activeDeadlineSeconds field has the following behavior: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-terminati... Can you confirm this sounds good to you?
Thanks! -- Chris Danis (they/them) Staff Site Reliability Engineer Wikimedia Foundation
Thanks for the quick reply.
Using activeDeadlineSeconds sounds good to me.
One thing I would note is that the job would be marked as having failed, though I think the distinction between complete and failed in that context would be not be significant.
Dreamy Jazz / WBrown (WMF)
On Mon, 30 Sept 2024 at 15:26, Chris Danis cdanis@wikimedia.org wrote:
On Mon, Sep 30, 2024 at 5:55 AM Dreamy Jazz dreamyjazzwikipedia@gmail.com wrote:
I have the occasional need to run a maintenance script manually one-off for a set amount of time. I used the "timeout" command with "mwscript" to exit the script after a specified amount of time. However, I cannot see an easy way to achieve this, as the mwscript-k8s command exiting doesn't stop the execution of the maintenance script.
Hi Dreamy,
I don't think this would be too hard to implement.
I think we could plumb through a new CLI argument for mwscript-k8s that, if set, would set the .spec.activeDeadlineSeconds field on the k8s Job object. It would be a small patch to the Helm chart and to the Python script.
The activeDeadlineSeconds field has the following behavior: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-terminati... Can you confirm this sounds good to you?
Thanks!
Chris Danis (they/them) Staff Site Reliability Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Good feedback, thanks both! Filed as https://phabricator.wikimedia.org/T376099.
On Mon, Sep 30, 2024 at 7:54 AM Dreamy Jazz dreamyjazzwikipedia@gmail.com wrote:
Thanks for the quick reply.
Using activeDeadlineSeconds sounds good to me.
One thing I would note is that the job would be marked as having failed, though I think the distinction between complete and failed in that context would be not be significant.
Dreamy Jazz / WBrown (WMF)
On Mon, 30 Sept 2024 at 15:26, Chris Danis cdanis@wikimedia.org wrote:
On Mon, Sep 30, 2024 at 5:55 AM Dreamy Jazz < dreamyjazzwikipedia@gmail.com> wrote:
I have the occasional need to run a maintenance script manually one-off for a set amount of time. I used the "timeout" command with "mwscript" to exit the script after a specified amount of time. However, I cannot see an easy way to achieve this, as the mwscript-k8s command exiting doesn't stop the execution of the maintenance script.
Hi Dreamy,
I don't think this would be too hard to implement.
I think we could plumb through a new CLI argument for mwscript-k8s that, if set, would set the .spec.activeDeadlineSeconds field on the k8s Job object. It would be a small patch to the Helm chart and to the Python script.
The activeDeadlineSeconds field has the following behavior: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-terminati... Can you confirm this sounds good to you?
Thanks!
I'm starting some batch maintenance of video transcodes so I'm exercising the new k8s-based maint script system on TMH's requeueTranscodes.php; good news: no surprises so far, everything's working just fine. :D
Since I'm running the same scripts over multiple wikis I went ahead and manually wrapped them in a bash for loop so it's submitting one job at a time out of all.dblist, using a screen session for the wrapper loop and tailing the logs to the session so they don't all smash out at once, and a second manually-started run for Commons. :)
First-class support for running over a dblist will be a very welcome improvement, and should be pretty straightforward! Good work everybody. :D
The longest job (Commons) might take a couple days to run, so we'll see if anything explodes later! hehe
-- brooke
On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus rlazarus@wikimedia.org wrote:
Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host. Either
deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your maintenance
script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days after
completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately
begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to
mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not
outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive label,
such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and
view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working on
migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes can
automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist,
etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Great to hear, thanks!
As a side note for others, to highlight something Brooke said in passing: Wrapping mwscript-k8s in a bash for loop is a fine idea, *as long as* you're running it with --follow or --attach. In that case, each mwscript-k8s invocation will keep running to monitor the job's output, and will terminate when the job terminates. One job will run at a time, which is what you expect.
Without --follow or --attach, mwscript-k8s is just the launcher: it kicks off your job, then terminates immediately. Your for loop will rapidfire *launch* all the jobs one after another, which means hundreds of 'em might be executing simultaneously, and that might not be what you had in mind. If your job involves expensive DB operations, it *really* might not be what you had in mind.
First-class dblist support will indeed make that pitfall easier to avoid. In the meantime there's nothing wrong with using a for loop, and it's what I'd do too -- but since this is a new system and nobody has well-honed intuition for it yet, I wanted to draw everyone's eye to that distinction.
On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber bvibber@wikimedia.org wrote:
I'm starting some batch maintenance of video transcodes so I'm exercising the new k8s-based maint script system on TMH's requeueTranscodes.php; good news: no surprises so far, everything's working just fine. :D
Since I'm running the same scripts over multiple wikis I went ahead and manually wrapped them in a bash for loop so it's submitting one job at a time out of all.dblist, using a screen session for the wrapper loop and tailing the logs to the session so they don't all smash out at once, and a second manually-started run for Commons. :)
First-class support for running over a dblist will be a very welcome improvement, and should be pretty straightforward! Good work everybody. :D
The longest job (Commons) might take a couple days to run, so we'll see if anything explodes later! hehe
-- brooke
On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus rlazarus@wikimedia.org wrote:
Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host.
Either deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your maintenance
script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days after
completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately
begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to
mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not
outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive label,
such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and
view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working on
migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes can
automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist,
etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Hi,
Well, there was one surprise, and it did explode! It was probably very close to a full outage in codfw. We are still in the process of documenting everything and we 'll be publishing a full incident report, but the TL;DR is that every single mwscript-k8s invocation that is happening in the for loop creates an entire helm release, including all k8s resources. This is by design, but it did have an unforeseen consequence and that is that close to 2k Calico Network Policies were created (including a ton of other resources, which would create their own set of problems), which meant all Calico components in k8s had to gradually react to the increasing number of those, which ended up hitting resources limits for some of those, which led to throttles and then to failures and into a slowly cascading outage that was putting hardware nodes one by one out of rotation. The last couple of hours were interesting for some of us, I can tell you that.
We are already working on plans (we got enough action items already, including amending the design) on how to fix this, but in the meantime, I 'd like to request that until we provide an update that we 've solved this, don't spawn mwscript-k8s in a for loop or anything similar. You can continue working of course with the tool to get acquainted with it, find bugs etc, just please don't spawn hundreds or even worse thousands of invocations.
Brooke, I had to kill your bash shell on deploy2002 doing the transcodes. I am sorry about that, but despite attaching to your screen I didn't manage to find how to stop it (it didn't respond to any of the usual control sequences and shell job controls) and I didn't want to risk 1 more outage (which would probably happen once the resources were reaching some critical number).
On Wed, Oct 9, 2024 at 6:16 AM Reuven Lazarus rlazarus@wikimedia.org wrote:
Great to hear, thanks!
As a side note for others, to highlight something Brooke said in passing: Wrapping mwscript-k8s in a bash for loop is a fine idea, *as long as* you're running it with --follow or --attach. In that case, each mwscript-k8s invocation will keep running to monitor the job's output, and will terminate when the job terminates. One job will run at a time, which is what you expect.
Without --follow or --attach, mwscript-k8s is just the launcher: it kicks off your job, then terminates immediately. Your for loop will rapidfire *launch* all the jobs one after another, which means hundreds of 'em might be executing simultaneously, and that might not be what you had in mind. If your job involves expensive DB operations, it *really* might not be what you had in mind.
First-class dblist support will indeed make that pitfall easier to avoid. In the meantime there's nothing wrong with using a for loop, and it's what I'd do too -- but since this is a new system and nobody has well-honed intuition for it yet, I wanted to draw everyone's eye to that distinction.
On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber bvibber@wikimedia.org wrote:
I'm starting some batch maintenance of video transcodes so I'm exercising the new k8s-based maint script system on TMH's requeueTranscodes.php; good news: no surprises so far, everything's working just fine. :D
Since I'm running the same scripts over multiple wikis I went ahead and manually wrapped them in a bash for loop so it's submitting one job at a time out of all.dblist, using a screen session for the wrapper loop and tailing the logs to the session so they don't all smash out at once, and a second manually-started run for Commons. :)
First-class support for running over a dblist will be a very welcome improvement, and should be pretty straightforward! Good work everybody. :D
The longest job (Commons) might take a couple days to run, so we'll see if anything explodes later! hehe
-- brooke
On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus rlazarus@wikimedia.org wrote:
Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host.
Either deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your maintenance
script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days after
completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately
begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to
mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not
outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive
label, such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and
view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working
on migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes can
automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist,
etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
-- *Reuven Lazarus *(he/him) Staff Site Reliability Engineer Wikimedia Foundation https://wikimediafoundation.org/ https://wikimediafoundation.org/ _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Yikes! And now... We know (knowing is half the battle)
-- brooke
On Wed, Oct 9, 2024, 12:12 AM Alexandros Kosiaris akosiaris@wikimedia.org wrote:
Hi,
Well, there was one surprise, and it did explode! It was probably very close to a full outage in codfw. We are still in the process of documenting everything and we 'll be publishing a full incident report, but the TL;DR is that every single mwscript-k8s invocation that is happening in the for loop creates an entire helm release, including all k8s resources. This is by design, but it did have an unforeseen consequence and that is that close to 2k Calico Network Policies were created (including a ton of other resources, which would create their own set of problems), which meant all Calico components in k8s had to gradually react to the increasing number of those, which ended up hitting resources limits for some of those, which led to throttles and then to failures and into a slowly cascading outage that was putting hardware nodes one by one out of rotation. The last couple of hours were interesting for some of us, I can tell you that.
We are already working on plans (we got enough action items already, including amending the design) on how to fix this, but in the meantime, I 'd like to request that until we provide an update that we 've solved this, don't spawn mwscript-k8s in a for loop or anything similar. You can continue working of course with the tool to get acquainted with it, find bugs etc, just please don't spawn hundreds or even worse thousands of invocations.
Brooke, I had to kill your bash shell on deploy2002 doing the transcodes. I am sorry about that, but despite attaching to your screen I didn't manage to find how to stop it (it didn't respond to any of the usual control sequences and shell job controls) and I didn't want to risk 1 more outage (which would probably happen once the resources were reaching some critical number).
On Wed, Oct 9, 2024 at 6:16 AM Reuven Lazarus rlazarus@wikimedia.org wrote:
Great to hear, thanks!
As a side note for others, to highlight something Brooke said in passing: Wrapping mwscript-k8s in a bash for loop is a fine idea, *as long as* you're running it with --follow or --attach. In that case, each mwscript-k8s invocation will keep running to monitor the job's output, and will terminate when the job terminates. One job will run at a time, which is what you expect.
Without --follow or --attach, mwscript-k8s is just the launcher: it kicks off your job, then terminates immediately. Your for loop will rapidfire *launch* all the jobs one after another, which means hundreds of 'em might be executing simultaneously, and that might not be what you had in mind. If your job involves expensive DB operations, it *really* might not be what you had in mind.
First-class dblist support will indeed make that pitfall easier to avoid. In the meantime there's nothing wrong with using a for loop, and it's what I'd do too -- but since this is a new system and nobody has well-honed intuition for it yet, I wanted to draw everyone's eye to that distinction.
On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber bvibber@wikimedia.org wrote:
I'm starting some batch maintenance of video transcodes so I'm exercising the new k8s-based maint script system on TMH's requeueTranscodes.php; good news: no surprises so far, everything's working just fine. :D
Since I'm running the same scripts over multiple wikis I went ahead and manually wrapped them in a bash for loop so it's submitting one job at a time out of all.dblist, using a screen session for the wrapper loop and tailing the logs to the session so they don't all smash out at once, and a second manually-started run for Commons. :)
First-class support for running over a dblist will be a very welcome improvement, and should be pretty straightforward! Good work everybody. :D
The longest job (Commons) might take a couple days to run, so we'll see if anything explodes later! hehe
-- brooke
On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus rlazarus@wikimedia.org wrote:
Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host.
Either deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your maintenance
script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days
after completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately
begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to
mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not
outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive
label, such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and
view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working
on migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes
can automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist,
etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
-- *Reuven Lazarus *(he/him) Staff Site Reliability Engineer Wikimedia Foundation https://wikimediafoundation.org/ https://wikimediafoundation.org/ _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
-- Alexandros Kosiaris Principal Site Reliability Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Should have sent this earlier, but here we go https://wikitech.wikimedia.org/wiki/Incidents/2024-10-09_calico-codfw
On Wed, Oct 9, 2024 at 4:20 PM Brooke Vibber bvibber@wikimedia.org wrote:
Yikes! And now... We know (knowing is half the battle)
-- brooke
On Wed, Oct 9, 2024, 12:12 AM Alexandros Kosiaris akosiaris@wikimedia.org wrote:
Hi,
Well, there was one surprise, and it did explode! It was probably very close to a full outage in codfw. We are still in the process of documenting everything and we 'll be publishing a full incident report, but the TL;DR is that every single mwscript-k8s invocation that is happening in the for loop creates an entire helm release, including all k8s resources. This is by design, but it did have an unforeseen consequence and that is that close to 2k Calico Network Policies were created (including a ton of other resources, which would create their own set of problems), which meant all Calico components in k8s had to gradually react to the increasing number of those, which ended up hitting resources limits for some of those, which led to throttles and then to failures and into a slowly cascading outage that was putting hardware nodes one by one out of rotation. The last couple of hours were interesting for some of us, I can tell you that.
We are already working on plans (we got enough action items already, including amending the design) on how to fix this, but in the meantime, I 'd like to request that until we provide an update that we 've solved this, don't spawn mwscript-k8s in a for loop or anything similar. You can continue working of course with the tool to get acquainted with it, find bugs etc, just please don't spawn hundreds or even worse thousands of invocations.
Brooke, I had to kill your bash shell on deploy2002 doing the transcodes. I am sorry about that, but despite attaching to your screen I didn't manage to find how to stop it (it didn't respond to any of the usual control sequences and shell job controls) and I didn't want to risk 1 more outage (which would probably happen once the resources were reaching some critical number).
On Wed, Oct 9, 2024 at 6:16 AM Reuven Lazarus rlazarus@wikimedia.org wrote:
Great to hear, thanks!
As a side note for others, to highlight something Brooke said in passing: Wrapping mwscript-k8s in a bash for loop is a fine idea, *as long as* you're running it with --follow or --attach. In that case, each mwscript-k8s invocation will keep running to monitor the job's output, and will terminate when the job terminates. One job will run at a time, which is what you expect.
Without --follow or --attach, mwscript-k8s is just the launcher: it kicks off your job, then terminates immediately. Your for loop will rapidfire *launch* all the jobs one after another, which means hundreds of 'em might be executing simultaneously, and that might not be what you had in mind. If your job involves expensive DB operations, it *really* might not be what you had in mind.
First-class dblist support will indeed make that pitfall easier to avoid. In the meantime there's nothing wrong with using a for loop, and it's what I'd do too -- but since this is a new system and nobody has well-honed intuition for it yet, I wanted to draw everyone's eye to that distinction.
On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber bvibber@wikimedia.org wrote:
I'm starting some batch maintenance of video transcodes so I'm exercising the new k8s-based maint script system on TMH's requeueTranscodes.php; good news: no surprises so far, everything's working just fine. :D
Since I'm running the same scripts over multiple wikis I went ahead and manually wrapped them in a bash for loop so it's submitting one job at a time out of all.dblist, using a screen session for the wrapper loop and tailing the logs to the session so they don't all smash out at once, and a second manually-started run for Commons. :)
First-class support for running over a dblist will be a very welcome improvement, and should be pretty straightforward! Good work everybody. :D
The longest job (Commons) might take a couple days to run, so we'll see if anything explodes later! hehe
-- brooke
On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus rlazarus@wikimedia.org wrote:
Hi all,
With MediaWiki at the WMF moving to Kubernetes, it's now time to start running manual maintenance scripts there. Any time you would previously SSH to a mwmaint host and run mwscript, follow these steps instead. The old way will continue working for a little while, but it will be going away.
What's familiar:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
What's different:
- Run mwscript-k8s on a deployment host, not the maintenance host.
Either deployment host will work; your job will automatically run in whichever data center is active, so you no longer need to change hosts when there’s a switchover.
- You don't need a tmux. By default the tool launches your
maintenance script and exits immediately, without waiting for your job to finish. If you log out of the deployment host, your job keeps running on the Kubernetes cluster.
- Kubernetes saves the maintenance script's output for seven days
after completion. By default, mwscript-k8s prints a kubectl command that you (or anyone else) can paste and run to monitor the output or save it to a file.
- As a convenience, you can pass -f (--follow) to mwscript-k8s to immediately
begin tailing the script output. If you like, you can do this inside a tmux and keep the same workflow as before. Either way, you can safely disconnect and your script will continue running on Kubernetes.
rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
[...]
MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024)
- For scripts that take input on stdin, you can pass --attach to
mwscript-k8s, either interactively or in a pipeline.
rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
[...]
Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
$wmgRealm
= "production"
rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- purgeList.php
[...]
Purging 1 urls
Done!
- Your maintenance script runs in a Docker container which will not
outlive it, so it can't save persistent files to disk. Ensure your script logs its important output to stdout, or persists it in a database or other remote storage.
- The --comment flag sets an optional (but encouraged) descriptive
label, such as a task number.
- Using standard kubectl commands[1][2], you can check the status, and
view the output, of your running jobs or anyone else's. (Example: `kube_env mw-script codfw; kubectl get pod -l username=rzl`)
What's not supported yet:
- Maintenance scripts launched automatically on a timer. We're working
on migrating them -- for now, this is for one-off scripts launched by hand.
- If your job is interrupted (e.g. by hardware problems), Kubernetes
can automatically move it to another machine and restart it, babysitting it until it completes. But we only want to do that if your job is safe to restart. So by default, if your job is interrupted, it will stay stopped until you restart it yourself. Soon, we'll add an option to declare "this is idempotent, please restart it as needed" and that design is recommended for new scripts.
- No support yet for mwscriptwikiset, foreachwiki,
foreachwikiindblist, etc, but we'll add similar functionality as flags to mwscript_k8s.
Your feedback:
Let me know by email or IRC, or on Phab (T341553 https://phabricator.wikimedia.org/T341553). If mwscript-k8s doesn't work for you, for now you can fall back to using the mwmaint hosts as before -- but they will be going away. Please report any problems sooner rather than later, so that we can ensure the new system meets your needs before that happens.
Thanks,
Reuven, for Service Ops SRE _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
-- *Reuven Lazarus *(he/him) Staff Site Reliability Engineer Wikimedia Foundation https://wikimediafoundation.org/ https://wikimediafoundation.org/ _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
-- Alexandros Kosiaris Principal Site Reliability Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
On 26/9/24 13:10, Reuven Lazarus wrote:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
Is that a limitation of Python's command line parsing?
I mean, the obvious way to do it from the viewpoint of usability is to take options after the first argument as belonging to the script.
-- Tim Starling
On Sun, Oct 20, 2024 at 4:24 PM Tim Starling tstarling@wikimedia.org wrote:
On 26/9/24 13:10, Reuven Lazarus wrote:
Starting a maintenance script looks like this:
rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php --wiki=enwiki
Any options for the mwscript-k8s tool, as described below, go before the --.
After the --, the first argument is the script name; everything else is passed to the script. This is the same as you're used to passing to mwscript.
Is that a limitation of Python's command line parsing?
I mean, the obvious way to do it from the viewpoint of usability is to take options after the first argument as belonging to the script.
No, this was a design choice. There are a few different ways to set up the interface, but each has different usability drawbacks.
(This has the potential to consume the thread; I'd prefer if it didn't, even though I appreciate the feedback! Happy to chat more about it off-list, or discuss a feature request as a subtask of T341553 https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?parent=341553&template=341553&status=open&subscribers= .)
The usability problem with the current approach is the extra --. It requires extra space and typing, it's easy to forget, and it's an extra thing for new users to learn. That's especially salient right now as everyone's a new user.
The usability problem with splitting at the script name is that it makes the parsing of non-positional arguments sensitive to their order, which is surprising. That is,
$ mwscript-k8s --verbose script.php $ mwscript-k8s script.php --verbose
would mean different things, which isn't what that option syntax usually indicates. (In the first, --verbose modifies the behavior of the mwscript-k8s wrapper, printing extra information about the process of launching the Kubernetes job. In the second, it modifies the behavior of the maintenance script.)
When you're lucky, one will work and the other unexpectedly won't; when you're unlucky, they'll both work but the behavior is unexpected. A similar issue already surprises people with mwscript as it exists today (most recently six days ago, https://phabricator.wikimedia.org/T372337#10231006). mwscript-k8s doesn't change that, since it's a function of MaintenanceRunner.php -- and shouldn't change it, since scripts like dumpBackup.php are sensitive to the order -- but this was an intentional decision not to add a second layer of order sensitivity. Instead we're sensitive to "before or after the separator," which is explicit, visually distinctive, and a familiar standard.
Other interface options (wrap the script and its args in quotation marks; let mwscript-k8s consume any options known to it, and pass the rest through to the maintenance script; use no command-line options at all for mwscript-k8s, just control it with environment variables) were rejected for other usability problems. (Most of the problems are self-evident, like being annoying to type. The consume-known-options approach is attractive when everything works well, but it has problems when names collide, like if both mwscript-k8s and the script have a --verbose; it also handles typos ungracefully.)
Python's built-in parsing isn't really a limiting factor, we can do whatever we want. After spending time thinking about it, this looked like the best choice, but I'm open to other ideas. :)
wikitech-l@lists.wikimedia.org