Hi all,
Starting Nov 7, a number of the jobs I would run through Toolforge grid have stopped working. Each job consists of a .sh file like this https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/grid/jobs/daily.sh on the first line of which I use the source command to activate a python virtual environment. When I run source by hand, subsequent lines work. But when I call the .sh file and it tries to run the source command, I get a "source: not found" message, the virtual environment does not get activated and indeed running *which python* returns */usr/bin/python* which is bad. All my scripts depend on pip packages that are installed in the virtual env and not available with the system python.
The main thing I did on Nov 7 was to add a line at the end of my too's account's .bash_profile as below:
exec zsh
This is because when I manually log into toolforge, I would like zsh to be my shell, and since tool accounts don't support chsh, I thought executing zsh directly from bash would be okay. But apparently, that now breaks the source command somehow.
So I wonder:
(a) Is there a way to properly change the default shell of tool accounts? (b) Is there a way to make *source* work under zsh?
Importantly, I know the problem is with *exec zsh* because once I removed it and logged out and back in, all scripts worked correctly.
Thanks, Huji
Instead of activating the virtual environment, you can call the python executable directly by its path, ie `venv/bin/python yourscript.py`. This will work largely the same as activating the venv, except that you have to always make sure to call the right executable, the shell won't figure it out for you.
On Thu, Nov 11, 2021 at 6:57 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
Starting Nov 7, a number of the jobs I would run through Toolforge grid have stopped working. Each job consists of a .sh file like this on the first line of which I use the source command to activate a python virtual environment. When I run source by hand, subsequent lines work. But when I call the .sh file and it tries to run the source command, I get a "source: not found" message, the virtual environment does not get activated and indeed running which python returns /usr/bin/python which is bad. All my scripts depend on pip packages that are installed in the virtual env and not available with the system python.
The main thing I did on Nov 7 was to add a line at the end of my too's account's .bash_profile as below:
exec zsh
This is because when I manually log into toolforge, I would like zsh to be my shell, and since tool accounts don't support chsh, I thought executing zsh directly from bash would be okay. But apparently, that now breaks the source command somehow.
So I wonder:
(a) Is there a way to properly change the default shell of tool accounts? (b) Is there a way to make source work under zsh?
Importantly, I know the problem is with exec zsh because once I removed it and logged out and back in, all scripts worked correctly.
Thanks, Huji _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
I like that idea! Will try it.
In the meantime, I am still eager to know why *source* is failing (only with zsh and only if called from within a .sh script). And also about the proper way to change the shell for tools accounts.
On Thu, Nov 11, 2021 at 7:04 PM AntiCompositeNumber < anticompositenumber@gmail.com> wrote:
Instead of activating the virtual environment, you can call the python executable directly by its path, ie `venv/bin/python yourscript.py`. This will work largely the same as activating the venv, except that you have to always make sure to call the right executable, the shell won't figure it out for you.
On Thu, Nov 11, 2021 at 6:57 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
Starting Nov 7, a number of the jobs I would run through Toolforge grid
have stopped working. Each job consists of a .sh file like this on the first line of which I use the source command to activate a python virtual environment. When I run source by hand, subsequent lines work. But when I call the .sh file and it tries to run the source command, I get a "source: not found" message, the virtual environment does not get activated and indeed running which python returns /usr/bin/python which is bad. All my scripts depend on pip packages that are installed in the virtual env and not available with the system python.
The main thing I did on Nov 7 was to add a line at the end of my too's
account's .bash_profile as below:
exec zsh
This is because when I manually log into toolforge, I would like zsh to
be my shell, and since tool accounts don't support chsh, I thought executing zsh directly from bash would be okay. But apparently, that now breaks the source command somehow.
So I wonder:
(a) Is there a way to properly change the default shell of tool accounts? (b) Is there a way to make source work under zsh?
Importantly, I know the problem is with exec zsh because once I removed
it and logged out and back in, all scripts worked correctly.
Thanks, Huji _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information:
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
The standard way to change your login shell is with "chsh". In theory you should be able to do:
$ chsh -s /usr/bin/zsh
and the next time you login, that should be your shell. "man chsh" for more details. I say "in theory" because I've never tried it on toolforge. I can't see any reason why it shouldn't work, but....
On Nov 11, 2021, at 7:44 PM, Huji Lee huji.huji@gmail.com wrote:
And also about the proper way to change the shell for tools accounts.
Well, chsh would not work here because when you run it, it'll ask for a password and tool accounts (the ones you use "become" command to log in as) are not passworded accounts.
On Thu, Nov 11, 2021 at 7:52 PM Roy Smith roy@panix.com wrote:
The standard way to change your login shell is with "chsh". In theory you should be able to do:
$ chsh -s /usr/bin/zsh
and the next time you login, that should be your shell. "man chsh" for more details. I say "in theory" because I've never tried it on toolforge. I can't see any reason why it shouldn't work, but....
On Nov 11, 2021, at 7:44 PM, Huji Lee huji.huji@gmail.com wrote:
And also about the proper way to change the shell for tools accounts.
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
On Thu, Nov 11, 2021 at 3:57 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
Starting Nov 7, a number of the jobs I would run through Toolforge grid have stopped working. Each job consists of a .sh file like this on the first line of which I use the source command to activate a python virtual environment. When I run source by hand, subsequent lines work. But when I call the .sh file and it tries to run the source command, I get a "source: not found" message,
How are you calling the .sh file? It doesn't have a shebang so I'm guessing you are doing $ sh daily.sh In that case, invoke it with `bash`/ `zsh` instead of `sh`, since `sh` is dash which does not have the not have the source command. Alternatively you can use a dot instead of `source` which actually is part of POSIX [1] and is implemented in dash,
`source` is also a zsh builtin [2] so I have no idea how it's breaking for you. Somehow your scripts are being run by dash if you have `exec zsh`, but since I don't know how you are invoking the scripts I cannot trace the code.
YiFei Zhu
[1] https://man7.org/linux/man-pages/man1/dot.1p.html [2] https://sourceforge.net/p/zsh/code/ci/master/tree/Src/builtin.c#l116
the virtual environment does not get activated and indeed running which python returns /usr/bin/python which is bad. All my scripts depend on pip packages that are installed in the virtual env and not available with the system python.
The main thing I did on Nov 7 was to add a line at the end of my too's account's .bash_profile as below:
exec zsh
This is because when I manually log into toolforge, I would like zsh to be my shell, and since tool accounts don't support chsh, I thought executing zsh directly from bash would be okay. But apparently, that now breaks the source command somehow.
So I wonder:
(a) Is there a way to properly change the default shell of tool accounts? (b) Is there a way to make source work under zsh?
Importantly, I know the problem is with exec zsh because once I removed it and logged out and back in, all scripts worked correctly.
Thanks, Huji _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
When I can't figure out WTF is going on with a shell script, I put something like this in the front of it:
/usr/bin/env > /tmp/this-is-my-environment
No matter how screwed up your env, path, choice of shell, output redirection, etc are, that's pretty much guaranteed to dump some useful information into someplace where you can find it. The likely suspects are $SHELL, $PATH, $PWD, $USER.
In truly bizarre cases, there might be some permission problem, which you can solve with:
touch /tmp/this-is-my-environment chmod 0666 /tmp/this-is-my-environment
On Nov 12, 2021, at 10:21 AM, YiFei Zhu zhuyifei1999@gmail.com wrote:
On Thu, Nov 11, 2021 at 3:57 PM Huji Lee <huji.huji@gmail.com mailto:huji.huji@gmail.com> wrote:
Hi all,
Starting Nov 7, a number of the jobs I would run through Toolforge grid have stopped working. Each job consists of a .sh file like this on the first line of which I use the source command to activate a python virtual environment. When I run source by hand, subsequent lines work. But when I call the .sh file and it tries to run the source command, I get a "source: not found" message,
How are you calling the .sh file? It doesn't have a shebang so I'm guessing you are doing $ sh daily.sh In that case, invoke it with `bash`/ `zsh` instead of `sh`, since `sh` is dash which does not have the not have the source command. Alternatively you can use a dot instead of `source` which actually is part of POSIX [1] and is implemented in dash,
`source` is also a zsh builtin [2] so I have no idea how it's breaking for you. Somehow your scripts are being run by dash if you have `exec zsh`, but since I don't know how you are invoking the scripts I cannot trace the code.
YiFei Zhu
[1] https://man7.org/linux/man-pages/man1/dot.1p.html https://man7.org/linux/man-pages/man1/dot.1p.html [2] https://sourceforge.net/p/zsh/code/ci/master/tree/Src/builtin.c#l116 https://sourceforge.net/p/zsh/code/ci/master/tree/Src/builtin.c#l116
the virtual environment does not get activated and indeed running which python returns /usr/bin/python which is bad. All my scripts depend on pip packages that are installed in the virtual env and not available with the system python.
The main thing I did on Nov 7 was to add a line at the end of my too's account's .bash_profile as below:
exec zsh
This is because when I manually log into toolforge, I would like zsh to be my shell, and since tool accounts don't support chsh, I thought executing zsh directly from bash would be okay. But apparently, that now breaks the source command somehow.
So I wonder:
(a) Is there a way to properly change the default shell of tool accounts? (b) Is there a way to make source work under zsh?
Importantly, I know the problem is with exec zsh because once I removed it and logged out and back in, all scripts worked correctly.
Thanks, Huji _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org mailto:cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Those are both great ideas! Let me explore more.
Just briefly to respond to YiFei: I call the scripts from cronjobs (see here https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/grid/crontab.backup) which makes me think your intuition is right and they are being called by / and not by my shell. Is there a way that I can tell, from within a script, how it was called?
On Fri, Nov 12, 2021 at 1:16 PM Roy Smith roy@panix.com wrote:
When I can't figure out WTF is going on with a shell script, I put something like this in the front of it:
/usr/bin/env > /tmp/this-is-my-environment
No matter how screwed up your env, path, choice of shell, output redirection, etc are, that's pretty much guaranteed to dump some useful information into someplace where you can find it. The likely suspects are $SHELL, $PATH, $PWD, $USER.
In truly bizarre cases, there might be some permission problem, which you can solve with:
touch /tmp/this-is-my-environment chmod 0666 /tmp/this-is-my-environment
On Nov 12, 2021, at 10:21 AM, YiFei Zhu zhuyifei1999@gmail.com wrote:
On Thu, Nov 11, 2021 at 3:57 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
Starting Nov 7, a number of the jobs I would run through Toolforge grid have stopped working. Each job consists of a .sh file like this on the first line of which I use the source command to activate a python virtual environment. When I run source by hand, subsequent lines work. But when I call the .sh file and it tries to run the source command, I get a "source: not found" message,
How are you calling the .sh file? It doesn't have a shebang so I'm guessing you are doing $ sh daily.sh In that case, invoke it with `bash`/ `zsh` instead of `sh`, since `sh` is dash which does not have the not have the source command. Alternatively you can use a dot instead of `source` which actually is part of POSIX [1] and is implemented in dash,
`source` is also a zsh builtin [2] so I have no idea how it's breaking for you. Somehow your scripts are being run by dash if you have `exec zsh`, but since I don't know how you are invoking the scripts I cannot trace the code.
YiFei Zhu
[1] https://man7.org/linux/man-pages/man1/dot.1p.html [2] https://sourceforge.net/p/zsh/code/ci/master/tree/Src/builtin.c#l116
the virtual environment does not get activated and indeed running which python returns /usr/bin/python which is bad. All my scripts depend on pip packages that are installed in the virtual env and not available with the system python.
The main thing I did on Nov 7 was to add a line at the end of my too's account's .bash_profile as below:
exec zsh
This is because when I manually log into toolforge, I would like zsh to be my shell, and since tool accounts don't support chsh, I thought executing zsh directly from bash would be okay. But apparently, that now breaks the source command somehow.
So I wonder:
(a) Is there a way to properly change the default shell of tool accounts? (b) Is there a way to make source work under zsh?
Importantly, I know the problem is with exec zsh because once I removed it and logged out and back in, all scripts worked correctly.
Thanks, Huji _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Actually, now that I think about it more, in a remote execution environment, it's possible that by the time the script gets run, it's not even running on the same machine, which means it'll have its own /tmp. If that's the case, then my next thought is a NFS path to your home directory which is valid on any possible execution machine. And pre-creating the file mode 0666 becomes more likely to be necessary.
On Nov 12, 2021, at 1:16 PM, Roy Smith roy@panix.com wrote:
No matter how screwed up your env, path, choice of shell, output redirection, etc are, that's pretty much guaranteed to dump some useful information into someplace where you can find it.
I had the chance to try out these options.
*YiFei *was right in that the scripts, when invoked by crontab after zsh was activated, were being invoked by sh and that was why the *source* command was not working.
Using . (dot) would resolve the issue on command line, but I cannot tell jsub to use dot. My current crontab entries look like this:
jsub -N "h" -once -o ~/err/hourly.out -e ~/err/hourly.err ~/grid/jobs/hourly.sh
But neither of these are allowed:
jsub -N "h" -once -o ~/err/hourly.out -e ~/err/hourly.err . ~/grid/jobs/hourly.sh
jsub -N "h" -once -o ~/err/hourly.out -e ~/err/hourly.err ". ~/grid/jobs/hourly.sh"
The first one causes jsub to find two arguments and crash; the second one causes jsub to complain that ". ~/grid/jobs/hourly.sh" is not a program.
*Roy Smith* offered a trick to explore what environment was used when my script was run. So I create a small script with content "/usr/bin/env > ~/myenv.jsub.txt" and then asked jsub to run it. The job was successfully submitted, but myenv.jsub.txt file remained empty! It is as if *no* environment was passed, which is confusing.
Anyway, for now, I have given up on `exec zsh` and am going to just invoke zsh manually after I run `become`.
On Fri, Nov 12, 2021 at 1:28 PM Roy Smith roy@panix.com wrote:
Actually, now that I think about it more, in a remote execution environment, it's possible that by the time the script gets run, it's not even running on the same machine, which means it'll have its own /tmp. If that's the case, then my next thought is a NFS path to your home directory which is valid on any possible execution machine. And pre-creating the file mode 0666 becomes more likely to be necessary.
On Nov 12, 2021, at 1:16 PM, Roy Smith roy@panix.com wrote:
No matter how screwed up your env, path, choice of shell, output redirection, etc are, that's pretty much guaranteed to dump some useful information into someplace where you can find it.
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
On Fri, Nov 12, 2021 at 6:58 PM Huji Lee huji.huji@gmail.com wrote:
I had the chance to try out these options.
YiFei was right in that the scripts, when invoked by crontab after zsh was activated, were being invoked by sh and that was why the source command was not working.
Using . (dot) would resolve the issue on command line, but I cannot tell jsub to use dot. My current crontab entries look like this:
jsub -N "h" -once -o ~/err/hourly.out -e ~/err/hourly.err ~/grid/jobs/hourly.sh
But neither of these are allowed:
jsub -N "h" -once -o ~/err/hourly.out -e ~/err/hourly.err . ~/grid/jobs/hourly.sh
jsub -N "h" -once -o ~/err/hourly.out -e ~/err/hourly.err ". ~/grid/jobs/hourly.sh"
The first one causes jsub to find two arguments and crash; the second one causes jsub to complain that ". ~/grid/jobs/hourly.sh" is not a program.
Apologies, I missed the emails.
By "." I mean to use it instead of "source". i.e. replace the "source" with "." in the scripts such as daily.sh [1], so that it looks like ". ~/venv/bin/activate".
[1] https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/grid/jobs/...
I did a bit of testing it out and if the script does not have a shebang and is not an ELF executable, it will be executed as "-bash -c /path/to/scriptfile.sh". Since you have "exec zsh" in your .bash_profile, bash will run it as startup as a login shell, which in theory would immediately replace itself with zsh with no arguments. zsh will then see it has no arguments, attempts to read script from stdin and get nothing, and immediately exit, stopping the job in grid. Where dash gets involved in this process I do not know.
Honestly, I think you should not depend on the behavior of shebang-less scripts as the executable. You should either put "bash /path/to/scriptfile.sh" or add a shebang to top of the script.
YiFei Zhu
Roy Smith offered a trick to explore what environment was used when my script was run. So I create a small script with content "/usr/bin/env > ~/myenv.jsub.txt" and then asked jsub to run it. The job was successfully submitted, but myenv.jsub.txt file remained empty! It is as if no environment was passed, which is confusing.
Anyway, for now, I have given up on `exec zsh` and am going to just invoke zsh manually after I run `become`.
On Fri, Nov 12, 2021 at 1:28 PM Roy Smith roy@panix.com wrote:
Actually, now that I think about it more, in a remote execution environment, it's possible that by the time the script gets run, it's not even running on the same machine, which means it'll have its own /tmp. If that's the case, then my next thought is a NFS path to your home directory which is valid on any possible execution machine. And pre-creating the file mode 0666 becomes more likely to be necessary.
On Nov 12, 2021, at 1:16 PM, Roy Smith roy@panix.com wrote:
No matter how screwed up your env, path, choice of shell, output redirection, etc are, that's pretty much guaranteed to dump some useful information into someplace where you can find it.
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
This is really good advice. Any time you've got a process being run by some tool on your behalf (cron, initd, remote job execution, etc), you're running in an alien environment. You get so used to things "just working" when you run them interactively, you forget how much of your carefully crafted login environment you're depending on. The more you make things totally explicit, the less chance there is for things to go south in a hard-to-debug way.
On Nov 14, 2021, at 8:47 AM, YiFei Zhu zhuyifei1999@gmail.com wrote:
Honestly, I think you should not depend on the behavior of shebang-less scripts as the executable. You should either put "bash /path/to/scriptfile.sh" or add a shebang to top of the script.
Again, good advice by both of you! Let me explore more and get back with any potential questions.
On Sun, Nov 14, 2021 at 9:24 AM Roy Smith roy@panix.com wrote:
This is really good advice. Any time you've got a process being run by some tool on your behalf (cron, initd, remote job execution, etc), you're running in an alien environment. You get so used to things "just working" when you run them interactively, you forget how much of your carefully crafted login environment you're depending on. The more you make things totally explicit, the less chance there is for things to go south in a hard-to-debug way.
On Nov 14, 2021, at 8:47 AM, YiFei Zhu zhuyifei1999@gmail.com wrote:
Honestly, I think you should not depend on the behavior of shebang-less scripts as the executable. You should either put "bash /path/to/scriptfile.sh" or add a shebang to top of the script.
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
I went back and reactivated the line in .bash_profile which enabled zsh ("exec zsh" as the last line of .bash_profile)
Then I submitted the job to the grid, using a command like this:
jsub -N "n" -once -o ~/err/nightly.out -e ~/err/nightly.err ~/grid/jobs/nightly.sh
I did it three ways. First, I used the nightly.sh file as is (see source https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/grid/jobs/nightly.sh). Second, I replaced "source" with "." and third I replaced "source" with "bash". In all three cases, it failed, without even producing an output or error. The nightly.out and nightly.err files were created of course, but were empty.
Next, I added a "#!/bin/bash" shabang and ran it again all three ways. Result was the same.
Running qstat many times shows that the job gets into a queued state ("qw") and after a few seconds, it goes into the run state ("r") and immediately stops.
Removing the "exec zsh" command from .bash_profile will make things work again.
Finally, I decided maybe the problem is that zsh is available for me, but not on the grid. So I change the .bash_profile ending from a single "exec zsh" command to this:
if [ -f /usr/bin/zsh ]; then zsh fi
Under this config, jobs on the grid worked, and when I used "become" to login as my tool, I ended with zsh. Obviously, I am happy with this workaround. But I am still curious as to the root cause.
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
On Sun, Nov 14, 2021 at 10:54 AM Huji Lee huji.huji@gmail.com wrote:
Again, good advice by both of you! Let me explore more and get back with any potential questions.
On Sun, Nov 14, 2021 at 9:24 AM Roy Smith roy@panix.com wrote:
This is really good advice. Any time you've got a process being run by some tool on your behalf (cron, initd, remote job execution, etc), you're running in an alien environment. You get so used to things "just working" when you run them interactively, you forget how much of your carefully crafted login environment you're depending on. The more you make things totally explicit, the less chance there is for things to go south in a hard-to-debug way.
On Nov 14, 2021, at 8:47 AM, YiFei Zhu zhuyifei1999@gmail.com wrote:
Honestly, I think you should not depend on the behavior of shebang-less scripts as the executable. You should either put "bash /path/to/scriptfile.sh" or add a shebang to top of the script.
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Submit a job that does:
ls -l /usr/bin/zsh
and see what it says.
On Nov 16, 2021, at 9:37 PM, Huji Lee huji.huji@gmail.com wrote:
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
It shows the "correct" answer, i.e.:
lrwxrwxrwx 1 root root 8 Dec 18 2019 /usr/bin/zsh -> /bin/zsh
So zsh is there and accissble on the grid. Still unclear why the scripts would fail on the grid, and why they would fail without outputing anything.
On Tue, Nov 16, 2021 at 9:43 PM Roy Smith roy@panix.com wrote:
Submit a job that does:
ls -l /usr/bin/zsh
and see what it says.
On Nov 16, 2021, at 9:37 PM, Huji Lee huji.huji@gmail.com wrote:
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Actually, all that proves is that the symlink exists. The linked-to target (/bin/zsh) may not.
If I may be so bold to ask, why are you trying to use zsh for scripts? Use whatever you want for interactive work, but for scripts, especially scripts to be run in magical environments like on the grid, what you want to be doing is keeping everything as standardized as possible, and that means bash. There is obviously something in the environment that's not what you expect (I'm using "environment" in the broader sense to include file system layout, permissions, default i/o streams, etc). You're not even able to capture stdout or stderr to someplace you can see it. This makes debugging extremely difficult. It's like trying to do brain surgery blindfolded and wearing boxing gloves.
I've gone down this rabbit hole before and my advice is to not go there. Just use bash. Keep everything as simple as possible.
On Nov 16, 2021, at 10:23 PM, Huji Lee huji.huji@gmail.com wrote:
It shows the "correct" answer, i.e.:
lrwxrwxrwx 1 root root 8 Dec 18 2019 /usr/bin/zsh -> /bin/zsh
So zsh is there and accissble on the grid. Still unclear why the scripts would fail on the grid, and why they would fail without outputing anything.
On Tue, Nov 16, 2021 at 9:43 PM Roy Smith <roy@panix.com mailto:roy@panix.com> wrote: Submit a job that does:
ls -l /usr/bin/zsh
and see what it says.
On Nov 16, 2021, at 9:37 PM, Huji Lee <huji.huji@gmail.com mailto:huji.huji@gmail.com> wrote:
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
Cloud mailing list -- cloud@lists.wikimedia.org mailto:cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
On Tue, Nov 16, 2021 at 6:38 PM Huji Lee huji.huji@gmail.com wrote:
I went back and reactivated the line in .bash_profile which enabled zsh ("exec zsh" as the last line of .bash_profile)
Then I submitted the job to the grid, using a command like this:
jsub -N "n" -once -o ~/err/nightly.out -e ~/err/nightly.err ~/grid/jobs/nightly.sh
I did it three ways. First, I used the nightly.sh file as is (see source). Second, I replaced "source" with "." and third I replaced "source" with "bash". In all three cases, it failed, without even producing an output or error. The nightly.out and nightly.err files were created of course, but were empty.
Next, I added a "#!/bin/bash" shabang and ran it again all three ways. Result was the same.
Running qstat many times shows that the job gets into a queued state ("qw") and after a few seconds, it goes into the run state ("r") and immediately stops.
Removing the "exec zsh" command from .bash_profile will make things work again.
Finally, I decided maybe the problem is that zsh is available for me, but not on the grid. So I change the .bash_profile ending from a single "exec zsh" command to this:
if [ -f /usr/bin/zsh ]; then zsh fi
Under this config, jobs on the grid worked, and when I used "become" to login as my tool, I ended with zsh. Obviously, I am happy with this workaround. But I am still curious as to the root cause.
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
This is consistent with what I described earlier:
Since you have "exec zsh" in your .bash_profile, bash will run it as startup as a login shell, which in theory would immediately replace itself with zsh with no arguments. zsh will then see it has no arguments, attempts to read script from stdin and get nothing, and immediately exit, stopping the job in grid.
However, now that you have "zsh" instead of "exec zsh", the "replace" is not done. bash as the login shell executes zsh as a subshell, and zsh, having no inputs, immediately exits. The execution continues as if nothing had ever happened.
I just tested the behavior of a how bash invokes .bash_profile by adding a sleep 60 to .bash_profile, and have my test.sh have a shebang, a a job is submitted for both with explicit 'bash' and without, and it looks like .bash_profile is executed in bath cases:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sgeadmin 762 0.4 0.1 111020 16056 ? Sl Mar25 1383:08 /usr/lib/gridengine/sge_execd [...] sgeadmin 20388 0.0 0.1 51468 8540 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 20390 0.0 0.0 23580 3196 ? Ss 07:57 0:00 _ -bash -c /data/project/zhuyifei1999-test/test.sh tools.z+ 20393 0.0 0.0 5796 672 ? S 07:57 0:00 _ sleep 60
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sgeadmin 752 0.3 0.1 115112 16100 ? Sl Mar25 1313:16 /usr/lib/gridengine/sge_execd [...] sgeadmin 8715 0.0 0.1 51468 8688 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 8717 0.0 0.0 23580 3324 ? Ss 07:57 0:00 _ -bash -c /bin/bash /data/project/zhuyifei1999-test/test.sh tools.z+ 8720 0.0 0.0 5796 656 ? S 07:57 0:00 _ sleep 60
It did take me by surprise that it's still bash that invokes the given command, because bash was not in the process tree for a usual "jsub [...] python script.sh". For example, a non-continuous job typically looks like this:
sgeadmin 28386 0.0 0.1 51468 8588 ? S Nov15 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.f+ 28388 7.2 3.5 427144 293024 ? Ss Nov15 210:55 | _ /usr/bin/python pycore/pwb.py pycore/fawikibot/rade.py -newcat:10
And a continuous one:
sgeadmin 3699 0.0 0.0 51464 4540 ? S Apr19 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.b+ 3701 0.0 0.0 4280 68 ? SNs Apr19 0:00 | _ /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0942/job_scripts/1302451 tools.b+ 3702 0.2 2.8 505104 231092 ? SNl Apr19 674:45 | _ /usr/bin/python bot2.py
There is no `-bash -c "python script.sh"`
However, if you trace what's going on, for a non-interactive bash that only receives a single command, it will directly execve that command:
$ strace -e clone,execve bash -c '/bin/true' execve("/bin/bash", ["bash", "-c", "/bin/true"], [/* 26 vars */]) = 0 execve("/bin/true", ["/bin/true"], [/* 25 vars */]) = 0 +++ exited with 0 +++
It does not involve child processes from the fork-exec model you'd expect. Therefore, we can say that no matter what you do with the job submission, a bash non-interactive login shell will be executed to run the command you specified to jsub. And the mess of "bash replace itself with zsh which immediately exits because stdin is empty" will apply.
I think it is important to clarify that a shell like bash has 4 modes of execution, defined by whether it is an interactive shell, and whether it is a login shell. The details for the modes in the case of bash you can find in its man page [1]. But tl;dr:
Login shells: - Upon startup, sources /etc/profile, then the first one among ~/.bash_profile, ~/.bash_login, and ~/.profile, that exists. - `bash -l` and `-bash` (note the dash sign at the front) makes bash a login shell
Non-login shells: - If also interactive, upon startup, sources ~/.bashrc
Interactive shells: - DIsplays a prompt for each command
Non-interactive shells: - Upon startup, sources $BASH_ENV if it exists - As we saw above, if the command is given in the command string in -c and there is only one command, bash does not fork-exec the command but execs the command directly.
So you might wonder why the separation of login shells (profile) vs non-login shells (rc). The reason is some environments are inherited by subshells while others are not. Environment variables are inherited:
$ export FOO=bar $ echo $FOO bar $ bash $ echo $FOO bar
While things like aliases are not:
$ alias foo='echo bar' $ foo bar $ bash $ foo bash: foo: command not found
There are environment setups that get inherited but you do not want it to be executed over and over by subshells. For example, appending to $PATH (`export PATH="$PATH:/path/to/bin"`). If it is in rc instead of profile, every time you run an interactive bash subshell PATH gets longer and more redundant; hence $PATH setups normally go to profile instead of rc. Non-inheritable setups like aliases go to rc. And the separation between .bash_profile and .profile is just so that you can have a .bash_profile that uses bash-specific syntax. I never needed any so I always use .profile.
And to have bash login shells also get the initialization from rc, .profile usually has a header like this:
# if running bash if [ -n "$BASH_VERSION" ]; then # include .bashrc if it exists if [ -f "$HOME/.bashrc" ]; then . "$HOME/.bashrc" fi fi
And .bashrc:
# Test for an interactive shell if [[ $- != *i* ]] ; then # Shell is non-interactive. Be done now! return fi
I hope this makes sense. Let me know if not.
Back to your question, let's see in what scenarios you would want to invoke zsh: - Non-interactive shells: No, you don't want `bash command.sh` randomly exec zsh - Interactive non-login shells: No, if you explicitly run `bash`, you want bash not zsh. - Interactive login shells. Yes, this is what `become tool` runs initially and you want bash here.
Hence, to run in a login shell environment you'd want the .profile or .bash_profile. And interactive guard is simply [[ $- = *i* ]] in bash syntax, so what you want, expressed in code, is in .bash_profile:
if [[ $- = *i* ]]; then exec zsh fi
As a side note, yes zsh exists on the grid hosts:
zhuyifei1999@tools-sgeexec-0901: ~$ ls -l {/usr,}/bin/zsh -rwxr-xr-x 1 root root 819744 Dec 1 2020 /bin/zsh lrwxrwxrwx 1 root root 8 Nov 22 2018 /usr/bin/zsh -> /bin/zsh
[1] https://man7.org/linux/man-pages/man1/bash.1.html#INVOCATION
YiFei Zhu
On Wed, Nov 17, 2021 at 1:04 AM YiFei Zhu zhuyifei1999@gmail.com wrote:
On Tue, Nov 16, 2021 at 6:38 PM Huji Lee huji.huji@gmail.com wrote:
I went back and reactivated the line in .bash_profile which enabled zsh ("exec zsh" as the last line of .bash_profile)
Then I submitted the job to the grid, using a command like this:
jsub -N "n" -once -o ~/err/nightly.out -e ~/err/nightly.err ~/grid/jobs/nightly.sh
I did it three ways. First, I used the nightly.sh file as is (see source). Second, I replaced "source" with "." and third I replaced "source" with "bash". In all three cases, it failed, without even producing an output or error. The nightly.out and nightly.err files were created of course, but were empty.
Next, I added a "#!/bin/bash" shabang and ran it again all three ways. Result was the same.
Running qstat many times shows that the job gets into a queued state ("qw") and after a few seconds, it goes into the run state ("r") and immediately stops.
Removing the "exec zsh" command from .bash_profile will make things work again.
Finally, I decided maybe the problem is that zsh is available for me, but not on the grid. So I change the .bash_profile ending from a single "exec zsh" command to this:
if [ -f /usr/bin/zsh ]; then zsh fi
Under this config, jobs on the grid worked, and when I used "become" to login as my tool, I ended with zsh. Obviously, I am happy with this workaround. But I am still curious as to the root cause.
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
This is consistent with what I described earlier:
Since you have "exec zsh" in your .bash_profile, bash will run it as startup as a login shell, which in theory would immediately replace itself with zsh with no arguments. zsh will then see it has no arguments, attempts to read script from stdin and get nothing, and immediately exit, stopping the job in grid.
However, now that you have "zsh" instead of "exec zsh", the "replace" is not done. bash as the login shell executes zsh as a subshell, and zsh, having no inputs, immediately exits. The execution continues as if nothing had ever happened.
I just tested the behavior of a how bash invokes .bash_profile by adding a sleep 60 to .bash_profile, and have my test.sh have a shebang, a a job is submitted for both with explicit 'bash' and without, and it looks like .bash_profile is executed in bath cases:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sgeadmin 762 0.4 0.1 111020 16056 ? Sl Mar25 1383:08 /usr/lib/gridengine/sge_execd [...] sgeadmin 20388 0.0 0.1 51468 8540 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 20390 0.0 0.0 23580 3196 ? Ss 07:57 0:00 _ -bash -c /data/project/zhuyifei1999-test/test.sh tools.z+ 20393 0.0 0.0 5796 672 ? S 07:57 0:00 _ sleep 60
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sgeadmin 752 0.3 0.1 115112 16100 ? Sl Mar25 1313:16 /usr/lib/gridengine/sge_execd [...] sgeadmin 8715 0.0 0.1 51468 8688 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 8717 0.0 0.0 23580 3324 ? Ss 07:57 0:00 _ -bash -c /bin/bash /data/project/zhuyifei1999-test/test.sh tools.z+ 8720 0.0 0.0 5796 656 ? S 07:57 0:00 _ sleep 60
It did take me by surprise that it's still bash that invokes the given command, because bash was not in the process tree for a usual "jsub [...] python script.sh". For example, a non-continuous job typically looks like this:
sgeadmin 28386 0.0 0.1 51468 8588 ? S Nov15 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.f+ 28388 7.2 3.5 427144 293024 ? Ss Nov15 210:55 | _ /usr/bin/python pycore/pwb.py pycore/fawikibot/rade.py -newcat:10
And a continuous one:
sgeadmin 3699 0.0 0.0 51464 4540 ? S Apr19 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.b+ 3701 0.0 0.0 4280 68 ? SNs Apr19 0:00 | _ /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0942/job_scripts/1302451 tools.b+ 3702 0.2 2.8 505104 231092 ? SNl Apr19 674:45 | _ /usr/bin/python bot2.py
There is no `-bash -c "python script.sh"`
However, if you trace what's going on, for a non-interactive bash that only receives a single command, it will directly execve that command:
$ strace -e clone,execve bash -c '/bin/true' execve("/bin/bash", ["bash", "-c", "/bin/true"], [/* 26 vars */]) = 0 execve("/bin/true", ["/bin/true"], [/* 25 vars */]) = 0 +++ exited with 0 +++
It does not involve child processes from the fork-exec model you'd expect. Therefore, we can say that no matter what you do with the job submission, a bash non-interactive login shell will be executed to run the command you specified to jsub. And the mess of "bash replace itself with zsh which immediately exits because stdin is empty" will apply.
I think it is important to clarify that a shell like bash has 4 modes of execution, defined by whether it is an interactive shell, and whether it is a login shell. The details for the modes in the case of bash you can find in its man page [1]. But tl;dr:
Login shells:
- Upon startup, sources /etc/profile, then the first one among
~/.bash_profile, ~/.bash_login, and ~/.profile, that exists.
- `bash -l` and `-bash` (note the dash sign at the front) makes bash a
login shell
Non-login shells:
- If also interactive, upon startup, sources ~/.bashrc
Interactive shells:
- DIsplays a prompt for each command
Non-interactive shells:
- Upon startup, sources $BASH_ENV if it exists
- As we saw above, if the command is given in the command string in -c
and there is only one command, bash does not fork-exec the command but execs the command directly.
So you might wonder why the separation of login shells (profile) vs non-login shells (rc). The reason is some environments are inherited by subshells while others are not. Environment variables are inherited:
$ export FOO=bar $ echo $FOO bar $ bash $ echo $FOO bar
While things like aliases are not:
$ alias foo='echo bar' $ foo bar $ bash $ foo bash: foo: command not found
There are environment setups that get inherited but you do not want it to be executed over and over by subshells. For example, appending to $PATH (`export PATH="$PATH:/path/to/bin"`). If it is in rc instead of profile, every time you run an interactive bash subshell PATH gets longer and more redundant; hence $PATH setups normally go to profile instead of rc. Non-inheritable setups like aliases go to rc. And the separation between .bash_profile and .profile is just so that you can have a .bash_profile that uses bash-specific syntax. I never needed any so I always use .profile.
And to have bash login shells also get the initialization from rc, .profile usually has a header like this:
# if running bash if [ -n "$BASH_VERSION" ]; then # include .bashrc if it exists if [ -f "$HOME/.bashrc" ]; then . "$HOME/.bashrc" fi fi
And .bashrc:
# Test for an interactive shell if [[ $- != *i* ]] ; then # Shell is non-interactive. Be done now! return fi
I hope this makes sense. Let me know if not.
Back to your question, let's see in what scenarios you would want to invoke zsh:
- Non-interactive shells: No, you don't want `bash command.sh` randomly exec zsh
- Interactive non-login shells: No, if you explicitly run `bash`, you
want bash not zsh.
- Interactive login shells. Yes, this is what `become tool` runs
initially and you want bash here.
Hence, to run in a login shell environment you'd want the .profile or .bash_profile. And interactive guard is simply [[ $- = *i* ]] in bash syntax, so what you want, expressed in code, is in .bash_profile:
if [[ $- = *i* ]]; then exec zsh fi
As a side note, yes zsh exists on the grid hosts:
zhuyifei1999@tools-sgeexec-0901: ~$ ls -l {/usr,}/bin/zsh -rwxr-xr-x 1 root root 819744 Dec 1 2020 /bin/zsh lrwxrwxrwx 1 root root 8 Nov 22 2018 /usr/bin/zsh -> /bin/zsh
[1] https://man7.org/linux/man-pages/man1/bash.1.html#INVOCATION
YiFei Zhu
Have you had a chance to take a look at it yet?
YiFei Zhu
I had to read through your email a few times to fully understand it. You provided lots of useful information; thank you!
I tried changing the code in my .bash_profile to what you suggested; after logging out and logging back in, zsh was my shell in interactive mode. I then submitted a job via jsub and that also seemed to work correctly. In short, it seems like what you suggested takes care of my problem. I will let you know if I find any evidence otherwise.
On Tue, Nov 23, 2021 at 12:06 PM YiFei Zhu zhuyifei1999@gmail.com wrote:
On Wed, Nov 17, 2021 at 1:04 AM YiFei Zhu zhuyifei1999@gmail.com wrote:
On Tue, Nov 16, 2021 at 6:38 PM Huji Lee huji.huji@gmail.com wrote:
I went back and reactivated the line in .bash_profile which enabled
zsh ("exec zsh" as the last line of .bash_profile)
Then I submitted the job to the grid, using a command like this:
jsub -N "n" -once -o ~/err/nightly.out -e ~/err/nightly.err
~/grid/jobs/nightly.sh
I did it three ways. First, I used the nightly.sh file as is (see
source). Second, I replaced "source" with "." and third I replaced "source" with "bash". In all three cases, it failed, without even producing an output or error. The nightly.out and nightly.err files were created of course, but were empty.
Next, I added a "#!/bin/bash" shabang and ran it again all three ways.
Result was the same.
Running qstat many times shows that the job gets into a queued state
("qw") and after a few seconds, it goes into the run state ("r") and immediately stops.
Removing the "exec zsh" command from .bash_profile will make things
work again.
Finally, I decided maybe the problem is that zsh is available for me,
but not on the grid. So I change the .bash_profile ending from a single "exec zsh" command to this:
if [ -f /usr/bin/zsh ]; then zsh fi
Under this config, jobs on the grid worked, and when I used "become"
to login as my tool, I ended with zsh. Obviously, I am happy with this workaround. But I am still curious as to the root cause.
Is it really that zsh is not available on the grid, and the grid tries
to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
This is consistent with what I described earlier:
Since you have "exec zsh" in your .bash_profile, bash will run it as startup as a login shell, which in theory would immediately replace itself with zsh with no arguments. zsh will then see it has no arguments, attempts to read script from stdin and get nothing, and immediately exit, stopping the job in grid.
However, now that you have "zsh" instead of "exec zsh", the "replace" is not done. bash as the login shell executes zsh as a subshell, and zsh, having no inputs, immediately exits. The execution continues as if nothing had ever happened.
I just tested the behavior of a how bash invokes .bash_profile by adding a sleep 60 to .bash_profile, and have my test.sh have a shebang, a a job is submitted for both with explicit 'bash' and without, and it looks like .bash_profile is executed in bath cases:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
COMMAND
sgeadmin 762 0.4 0.1 111020 16056 ? Sl Mar25 1383:08 /usr/lib/gridengine/sge_execd [...] sgeadmin 20388 0.0 0.1 51468 8540 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 20390 0.0 0.0 23580 3196 ? Ss 07:57 0:00 _ -bash -c /data/project/zhuyifei1999-test/test.sh tools.z+ 20393 0.0 0.0 5796 672 ? S 07:57 0:00 _ sleep 60
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME
COMMAND
sgeadmin 752 0.3 0.1 115112 16100 ? Sl Mar25 1313:16 /usr/lib/gridengine/sge_execd [...] sgeadmin 8715 0.0 0.1 51468 8688 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 8717 0.0 0.0 23580 3324 ? Ss 07:57 0:00 _ -bash -c /bin/bash /data/project/zhuyifei1999-test/test.sh tools.z+ 8720 0.0 0.0 5796 656 ? S 07:57 0:00 _ sleep 60
It did take me by surprise that it's still bash that invokes the given command, because bash was not in the process tree for a usual "jsub [...] python script.sh". For example, a non-continuous job typically looks like this:
sgeadmin 28386 0.0 0.1 51468 8588 ? S Nov15 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.f+ 28388 7.2 3.5 427144 293024 ? Ss Nov15 210:55 | _ /usr/bin/python pycore/pwb.py pycore/fawikibot/rade.py -newcat:10
And a continuous one:
sgeadmin 3699 0.0 0.0 51464 4540 ? S Apr19 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.b+ 3701 0.0 0.0 4280 68 ? SNs Apr19 0:00 | _ /bin/sh
/var/spool/gridengine/execd/tools-sgeexec-0942/job_scripts/1302451
tools.b+ 3702 0.2 2.8 505104 231092 ? SNl Apr19 674:45 | _ /usr/bin/python bot2.py
There is no `-bash -c "python script.sh"`
However, if you trace what's going on, for a non-interactive bash that only receives a single command, it will directly execve that command:
$ strace -e clone,execve bash -c '/bin/true' execve("/bin/bash", ["bash", "-c", "/bin/true"], [/* 26 vars */]) = 0 execve("/bin/true", ["/bin/true"], [/* 25 vars */]) = 0 +++ exited with 0 +++
It does not involve child processes from the fork-exec model you'd expect. Therefore, we can say that no matter what you do with the job submission, a bash non-interactive login shell will be executed to run the command you specified to jsub. And the mess of "bash replace itself with zsh which immediately exits because stdin is empty" will apply.
I think it is important to clarify that a shell like bash has 4 modes of execution, defined by whether it is an interactive shell, and whether it is a login shell. The details for the modes in the case of bash you can find in its man page [1]. But tl;dr:
Login shells:
- Upon startup, sources /etc/profile, then the first one among
~/.bash_profile, ~/.bash_login, and ~/.profile, that exists.
- `bash -l` and `-bash` (note the dash sign at the front) makes bash a
login shell
Non-login shells:
- If also interactive, upon startup, sources ~/.bashrc
Interactive shells:
- DIsplays a prompt for each command
Non-interactive shells:
- Upon startup, sources $BASH_ENV if it exists
- As we saw above, if the command is given in the command string in -c
and there is only one command, bash does not fork-exec the command but execs the command directly.
So you might wonder why the separation of login shells (profile) vs non-login shells (rc). The reason is some environments are inherited by subshells while others are not. Environment variables are inherited:
$ export FOO=bar $ echo $FOO bar $ bash $ echo $FOO bar
While things like aliases are not:
$ alias foo='echo bar' $ foo bar $ bash $ foo bash: foo: command not found
There are environment setups that get inherited but you do not want it to be executed over and over by subshells. For example, appending to $PATH (`export PATH="$PATH:/path/to/bin"`). If it is in rc instead of profile, every time you run an interactive bash subshell PATH gets longer and more redundant; hence $PATH setups normally go to profile instead of rc. Non-inheritable setups like aliases go to rc. And the separation between .bash_profile and .profile is just so that you can have a .bash_profile that uses bash-specific syntax. I never needed any so I always use .profile.
And to have bash login shells also get the initialization from rc, .profile usually has a header like this:
# if running bash if [ -n "$BASH_VERSION" ]; then # include .bashrc if it exists if [ -f "$HOME/.bashrc" ]; then . "$HOME/.bashrc" fi fi
And .bashrc:
# Test for an interactive shell if [[ $- != *i* ]] ; then # Shell is non-interactive. Be done now! return fi
I hope this makes sense. Let me know if not.
Back to your question, let's see in what scenarios you would want to
invoke zsh:
- Non-interactive shells: No, you don't want `bash command.sh` randomly
exec zsh
- Interactive non-login shells: No, if you explicitly run `bash`, you
want bash not zsh.
- Interactive login shells. Yes, this is what `become tool` runs
initially and you want bash here.
Hence, to run in a login shell environment you'd want the .profile or .bash_profile. And interactive guard is simply [[ $- = *i* ]] in bash syntax, so what you want, expressed in code, is in .bash_profile:
if [[ $- = *i* ]]; then exec zsh fi
As a side note, yes zsh exists on the grid hosts:
zhuyifei1999@tools-sgeexec-0901: ~$ ls -l {/usr,}/bin/zsh -rwxr-xr-x 1 root root 819744 Dec 1 2020 /bin/zsh lrwxrwxrwx 1 root root 8 Nov 22 2018 /usr/bin/zsh -> /bin/zsh
[1] https://man7.org/linux/man-pages/man1/bash.1.html#INVOCATION
YiFei Zhu
Have you had a chance to take a look at it yet?
YiFei Zhu _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
On Thu, Dec 2, 2021 at 7:14 PM Huji Lee huji.huji@gmail.com wrote:
I had to read through your email a few times to fully understand it. You provided lots of useful information; thank you!
Yep, things like terminals and shells are more complicated than they look ;)
I tried changing the code in my .bash_profile to what you suggested; after logging out and logging back in, zsh was my shell in interactive mode. I then submitted a job via jsub and that also seemed to work correctly. In short, it seems like what you suggested takes care of my problem. I will let you know if I find any evidence otherwise.
Sounds good to me
YiFei Zhu
On Tue, Nov 23, 2021 at 12:06 PM YiFei Zhu zhuyifei1999@gmail.com wrote:
On Wed, Nov 17, 2021 at 1:04 AM YiFei Zhu zhuyifei1999@gmail.com wrote:
On Tue, Nov 16, 2021 at 6:38 PM Huji Lee huji.huji@gmail.com wrote:
I went back and reactivated the line in .bash_profile which enabled zsh ("exec zsh" as the last line of .bash_profile)
Then I submitted the job to the grid, using a command like this:
jsub -N "n" -once -o ~/err/nightly.out -e ~/err/nightly.err ~/grid/jobs/nightly.sh
I did it three ways. First, I used the nightly.sh file as is (see source). Second, I replaced "source" with "." and third I replaced "source" with "bash". In all three cases, it failed, without even producing an output or error. The nightly.out and nightly.err files were created of course, but were empty.
Next, I added a "#!/bin/bash" shabang and ran it again all three ways. Result was the same.
Running qstat many times shows that the job gets into a queued state ("qw") and after a few seconds, it goes into the run state ("r") and immediately stops.
Removing the "exec zsh" command from .bash_profile will make things work again.
Finally, I decided maybe the problem is that zsh is available for me, but not on the grid. So I change the .bash_profile ending from a single "exec zsh" command to this:
if [ -f /usr/bin/zsh ]; then zsh fi
Under this config, jobs on the grid worked, and when I used "become" to login as my tool, I ended with zsh. Obviously, I am happy with this workaround. But I am still curious as to the root cause.
Is it really that zsh is not available on the grid, and the grid tries to replicate my environment first and reaches the "exec zsh" command and falls apart somehow?
This is consistent with what I described earlier:
Since you have "exec zsh" in your .bash_profile, bash will run it as startup as a login shell, which in theory would immediately replace itself with zsh with no arguments. zsh will then see it has no arguments, attempts to read script from stdin and get nothing, and immediately exit, stopping the job in grid.
However, now that you have "zsh" instead of "exec zsh", the "replace" is not done. bash as the login shell executes zsh as a subshell, and zsh, having no inputs, immediately exits. The execution continues as if nothing had ever happened.
I just tested the behavior of a how bash invokes .bash_profile by adding a sleep 60 to .bash_profile, and have my test.sh have a shebang, a a job is submitted for both with explicit 'bash' and without, and it looks like .bash_profile is executed in bath cases:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sgeadmin 762 0.4 0.1 111020 16056 ? Sl Mar25 1383:08 /usr/lib/gridengine/sge_execd [...] sgeadmin 20388 0.0 0.1 51468 8540 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 20390 0.0 0.0 23580 3196 ? Ss 07:57 0:00 _ -bash -c /data/project/zhuyifei1999-test/test.sh tools.z+ 20393 0.0 0.0 5796 672 ? S 07:57 0:00 _ sleep 60
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sgeadmin 752 0.3 0.1 115112 16100 ? Sl Mar25 1313:16 /usr/lib/gridengine/sge_execd [...] sgeadmin 8715 0.0 0.1 51468 8688 ? S 07:57 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.z+ 8717 0.0 0.0 23580 3324 ? Ss 07:57 0:00 _ -bash -c /bin/bash /data/project/zhuyifei1999-test/test.sh tools.z+ 8720 0.0 0.0 5796 656 ? S 07:57 0:00 _ sleep 60
It did take me by surprise that it's still bash that invokes the given command, because bash was not in the process tree for a usual "jsub [...] python script.sh". For example, a non-continuous job typically looks like this:
sgeadmin 28386 0.0 0.1 51468 8588 ? S Nov15 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.f+ 28388 7.2 3.5 427144 293024 ? Ss Nov15 210:55 | _ /usr/bin/python pycore/pwb.py pycore/fawikibot/rade.py -newcat:10
And a continuous one:
sgeadmin 3699 0.0 0.0 51464 4540 ? S Apr19 0:00 _ /usr/lib/gridengine/sge_shepherd -bg tools.b+ 3701 0.0 0.0 4280 68 ? SNs Apr19 0:00 | _ /bin/sh /var/spool/gridengine/execd/tools-sgeexec-0942/job_scripts/1302451 tools.b+ 3702 0.2 2.8 505104 231092 ? SNl Apr19 674:45 | _ /usr/bin/python bot2.py
There is no `-bash -c "python script.sh"`
However, if you trace what's going on, for a non-interactive bash that only receives a single command, it will directly execve that command:
$ strace -e clone,execve bash -c '/bin/true' execve("/bin/bash", ["bash", "-c", "/bin/true"], [/* 26 vars */]) = 0 execve("/bin/true", ["/bin/true"], [/* 25 vars */]) = 0 +++ exited with 0 +++
It does not involve child processes from the fork-exec model you'd expect. Therefore, we can say that no matter what you do with the job submission, a bash non-interactive login shell will be executed to run the command you specified to jsub. And the mess of "bash replace itself with zsh which immediately exits because stdin is empty" will apply.
I think it is important to clarify that a shell like bash has 4 modes of execution, defined by whether it is an interactive shell, and whether it is a login shell. The details for the modes in the case of bash you can find in its man page [1]. But tl;dr:
Login shells:
- Upon startup, sources /etc/profile, then the first one among
~/.bash_profile, ~/.bash_login, and ~/.profile, that exists.
- `bash -l` and `-bash` (note the dash sign at the front) makes bash a
login shell
Non-login shells:
- If also interactive, upon startup, sources ~/.bashrc
Interactive shells:
- DIsplays a prompt for each command
Non-interactive shells:
- Upon startup, sources $BASH_ENV if it exists
- As we saw above, if the command is given in the command string in -c
and there is only one command, bash does not fork-exec the command but execs the command directly.
So you might wonder why the separation of login shells (profile) vs non-login shells (rc). The reason is some environments are inherited by subshells while others are not. Environment variables are inherited:
$ export FOO=bar $ echo $FOO bar $ bash $ echo $FOO bar
While things like aliases are not:
$ alias foo='echo bar' $ foo bar $ bash $ foo bash: foo: command not found
There are environment setups that get inherited but you do not want it to be executed over and over by subshells. For example, appending to $PATH (`export PATH="$PATH:/path/to/bin"`). If it is in rc instead of profile, every time you run an interactive bash subshell PATH gets longer and more redundant; hence $PATH setups normally go to profile instead of rc. Non-inheritable setups like aliases go to rc. And the separation between .bash_profile and .profile is just so that you can have a .bash_profile that uses bash-specific syntax. I never needed any so I always use .profile.
And to have bash login shells also get the initialization from rc, .profile usually has a header like this:
# if running bash if [ -n "$BASH_VERSION" ]; then # include .bashrc if it exists if [ -f "$HOME/.bashrc" ]; then . "$HOME/.bashrc" fi fi
And .bashrc:
# Test for an interactive shell if [[ $- != *i* ]] ; then # Shell is non-interactive. Be done now! return fi
I hope this makes sense. Let me know if not.
Back to your question, let's see in what scenarios you would want to invoke zsh:
- Non-interactive shells: No, you don't want `bash command.sh` randomly exec zsh
- Interactive non-login shells: No, if you explicitly run `bash`, you
want bash not zsh.
- Interactive login shells. Yes, this is what `become tool` runs
initially and you want bash here.
Hence, to run in a login shell environment you'd want the .profile or .bash_profile. And interactive guard is simply [[ $- = *i* ]] in bash syntax, so what you want, expressed in code, is in .bash_profile:
if [[ $- = *i* ]]; then exec zsh fi
As a side note, yes zsh exists on the grid hosts:
zhuyifei1999@tools-sgeexec-0901: ~$ ls -l {/usr,}/bin/zsh -rwxr-xr-x 1 root root 819744 Dec 1 2020 /bin/zsh lrwxrwxrwx 1 root root 8 Nov 22 2018 /usr/bin/zsh -> /bin/zsh
[1] https://man7.org/linux/man-pages/man1/bash.1.html#INVOCATION
YiFei Zhu
Have you had a chance to take a look at it yet?
YiFei Zhu _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/