Dr. Trigon wrote:
On 03.03.2012 22:46, Merlissimo wrote:
Hello toolserver users,
[...]
First thanks a lot for your effort!! First I thought why making it even
more complicate (this increases the possibility of mal-config) but after
setting my cron(ie)tab up I have to say it makes sense! Will be a good
thing even thought not simple.
--
Example:
This morning Dr. Trigon complained that his job "mainbot" did not run
immediatly and was queued for a long time. I would guess he submitted his job from cron
using "cronsub mainbot -l /home/drtrigon/pywikipedia/mainbot.py"
This indicates that the job runs forevery (longrun) with unkown memory usage. So grid
engine was only able to start this job on willow.
It is not possible to run infinite job on the webservers (only shorter jobs are allowed
so that most jobs have finished before high webserver usage is expected during the
evening). Nor it was possible to run it on the server running mail transfer agent which
only have less than 500MB memory free, but much cpu power (expected memory usage is
unkown). Other servers like nightshade and yarrow aren't currently available.
Thanks for taking me as an example - that help a lot... ;))
The exact command was:
cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron
(very close... ;)
According to the last run of this job it takes
about 2 hours and 30 minutes runtime and had a peek usage of 370 MB memory. I got these
values by requesting grid engine about usage statistics of the last ten days: "qacct
-j mainbot -d 10".
To be safe that the job gets always enough resouces i would suggest to raise the values
to 4 hours and 500MB memory. It is not a problem if you request more resouces than really
needed, but job needing more resources than requested may be killed. So the new submit
command would be:
"qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB
/home/drtrigon/pywikipedia/mainbot.py"
I use now:
qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot
$HOME/pywikipedia/bot_control.py -default -cron
(a little bit more of time and '-m a -b y')
And here my key question arises; you mentioned 'qacct' to get more info
(thanks for this hint) and this is one of the biggest problem I had with
the whole SGE stuff; I was not able to get a complete docu whether on
the toolserver nor else. At the moment, on the toolserver commands like
'qstat' or 'qdel' are not covered anymore.
I (we) would like to know more about this great system.
E.g. what is the analogue to the old commands:
'cronsub [jobname] [command]'
has become
'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname]
[command]'
'cronsub -l [jobname] [command]'
has become
'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname]
[command]'
In both cases the old behavior was without -m a -b y, so
'cronsub [jobname] [command]'
has become
'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -N [jobname] [command]
'cronsub -l [jobname] [command]'
has become
'qcronsub -l h_rt=INFINITY -l virtual_free=100M -N [jobname] [command]
The -b y option is mostly useful for binaries, e.g. if you don't submit the python
script itself, but call the binary interpreter (python) with an argument.
It is just an option if the submitted script file should be copied to a local filesystem
on execution server (which increases performance, makes nfs error impossible and was
always the default
setting) or executed directly from your home (if you use -b y). In most cases this option
isn't needed and copying is the best for most shell scripts.
I added this to the interwiki bots exmaple because
1: if you submit a job interwiki.py the file is copied the sge spool directory and the job
is queued
2: then you update your local svn copy
3: afterwards the job is started
Now the copied interwiki.py can be older than the rest of your pwd files. So in this case
its better to use the same version of all pwd files directly from your homedir. It's
very unlikey that this
problem really happends, but i wanted to write a perfect example.
as far as I can see... (do not remember what the
'-s' was for...)
Is this correct?
The -s at cronsub is for merging standard error stream into standard output stream.
This is now -j y. I don't know why river used -s for this (perhaps for stream?).
Some weeks ago i installed a script that removes empty log files for standard error/output
stream after job execution. Many people used this option to prevent that their homedir
contain so many empty
error logs.
Thanks for your work Merlissimo
and greetings
DrTrigon
Merlissimo