Dr. Trigon wrote:
On 03.03.2012 22:46, Merlissimo wrote:
Hello toolserver users,
[...]
First thanks a lot for your effort!! First I thought why making it even more complicate (this increases the possibility of mal-config) but after setting my cron(ie)tab up I have to say it makes sense! Will be a good thing even thought not simple.
-- Example:
This morning Dr. Trigon complained that his job "mainbot" did not run immediatly and was queued for a long time. I would guess he submitted his job from cron using "cronsub mainbot -l /home/drtrigon/pywikipedia/mainbot.py" This indicates that the job runs forevery (longrun) with unkown memory usage. So grid engine was only able to start this job on willow. It is not possible to run infinite job on the webservers (only shorter jobs are allowed so that most jobs have finished before high webserver usage is expected during the evening). Nor it was possible to run it on the server running mail transfer agent which only have less than 500MB memory free, but much cpu power (expected memory usage is unkown). Other servers like nightshade and yarrow aren't currently available.
Thanks for taking me as an example - that help a lot... ;))
The exact command was: cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron (very close... ;)
According to the last run of this job it takes about 2 hours and 30 minutes runtime and had a peek usage of 370 MB memory. I got these values by requesting grid engine about usage statistics of the last ten days: "qacct -j mainbot -d 10". To be safe that the job gets always enough resouces i would suggest to raise the values to 4 hours and 500MB memory. It is not a problem if you request more resouces than really needed, but job needing more resources than requested may be killed. So the new submit command would be:
"qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB /home/drtrigon/pywikipedia/mainbot.py"
I use now: qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot $HOME/pywikipedia/bot_control.py -default -cron (a little bit more of time and '-m a -b y')
And here my key question arises; you mentioned 'qacct' to get more info (thanks for this hint) and this is one of the biggest problem I had with the whole SGE stuff; I was not able to get a complete docu whether on the toolserver nor else. At the moment, on the toolserver commands like 'qstat' or 'qdel' are not covered anymore. I (we) would like to know more about this great system.
E.g. what is the analogue to the old commands:
'cronsub [jobname] [command]' has become 'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname] [command]'
'cronsub -l [jobname] [command]' has become 'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname] [command]'
In both cases the old behavior was without -m a -b y, so
'cronsub [jobname] [command]' has become 'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -N [jobname] [command]
'cronsub -l [jobname] [command]' has become 'qcronsub -l h_rt=INFINITY -l virtual_free=100M -N [jobname] [command]
The -b y option is mostly useful for binaries, e.g. if you don't submit the python script itself, but call the binary interpreter (python) with an argument. It is just an option if the submitted script file should be copied to a local filesystem on execution server (which increases performance, makes nfs error impossible and was always the default setting) or executed directly from your home (if you use -b y). In most cases this option isn't needed and copying is the best for most shell scripts.
I added this to the interwiki bots exmaple because 1: if you submit a job interwiki.py the file is copied the sge spool directory and the job is queued 2: then you update your local svn copy 3: afterwards the job is started Now the copied interwiki.py can be older than the rest of your pwd files. So in this case its better to use the same version of all pwd files directly from your homedir. It's very unlikey that this problem really happends, but i wanted to write a perfect example.
as far as I can see... (do not remember what the '-s' was for...) Is this correct?
The -s at cronsub is for merging standard error stream into standard output stream. This is now -j y. I don't know why river used -s for this (perhaps for stream?). Some weeks ago i installed a script that removes empty log files for standard error/output stream after job execution. Many people used this option to prevent that their homedir contain so many empty error logs.
Thanks for your work Merlissimo and greetings DrTrigon
Merlissimo