-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 03.03.2012 22:46, Merlissimo wrote:
Hello toolserver users,
as you may know, there were some bigger problems related to sun grid
engine starting in november 2011. I asked DaB. to become a sge manager
for helping them to solve these problems.
During the last months i silently started reconfiguring sge in small
steps so that it was always possible to use it as before and no downtime
was needed. This took some time because i am only a volunteer and i had
to changes nearly everything. Additional Nosy and DaB. changed some
solaris configurations that i proposed.
All scripts that used grid engine before can continue to run without
changes. But maybe you can increase your script performance by adding
additional informations.
In the past you were requested to choose a suitable queue (all.q or
longrun) for your job. Many people choosed a queue that did not fit best
for their task. So i changed this procedure.
Now you have to add all resources that your job needs during runtime on
job submition. Then sge will choose queue and host that fits best for
your requirements. So you don't have to care about different queues
anymore (you may have seen that there are much more queues than before).
All jobs must at least contain informations about maximum runtime (h_rt)
and peak memory usage (virtual_free). This information may get
obligatory in future. Currently only a warning message is shown.
You also have to request other resources like sql connections, free temp
space, etc. if these are needed by your job. Please read documentation
on toolserverwiki i have updated today:
<https://wiki.toolserver.org/view/Job_scheduling>
This currently contains the main informations you need to know, but
maybe i add some more examples later.
I also have added a new script called "qcronsub". This is the
replacement for "cronsub" most of you used before. Differently to
cronsub it accepts the same arguments as the original "qsub" command by
grid engine. So now it is possible the add all resource values at
command line.
Please note that you should always use cronie at
submit.toolserver.org
for submitting jobs to sge by cron. These cron tasks will always be
executed even if one host (e.g. clematis or willow) is down. This is the
suggested usage since about 17 months. Many people have migrated their
cron jobs from nightshade to willow during the last weeks. But they will
have the same problem again if willow must be shut down for a longer
time (which hopefully never happens).
First thanks a lot for your effort!! First I thought why making it even
more complicate (this increases the possibility of mal-config) but after
setting my cron(ie)tab up I have to say it makes sense! Will be a good
thing even thought not simple.
--
Example:
This morning Dr. Trigon complained that his job "mainbot" did not run
immediatly and was queued for a long time. I would guess he submitted his job from cron
using "cronsub mainbot -l /home/drtrigon/pywikipedia/mainbot.py"
This indicates that the job runs forevery (longrun) with unkown memory usage. So grid
engine was only able to start this job on willow.
It is not possible to run infinite job on the webservers (only shorter jobs are allowed
so that most jobs have finished before high webserver usage is expected during the
evening). Nor it was possible to run it on the server running mail transfer agent which
only have less than 500MB memory free, but much cpu power (expected memory usage is
unkown). Other servers like nightshade and yarrow aren't currently available.
Thanks for taking me as an example - that help a lot... ;))
The exact command was:
cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron
(very close... ;)
According to the last run of this job it takes about 2
hours and 30 minutes runtime and had a peek usage of 370 MB memory. I got these values by
requesting grid engine about usage statistics of the last ten days: "qacct -j mainbot
-d 10".
To be safe that the job gets always enough resouces i would suggest to raise the values
to 4 hours and 500MB memory. It is not a problem if you request more resouces than really
needed, but job needing more resources than requested may be killed. So the new submit
command would be:
"qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB
/home/drtrigon/pywikipedia/mainbot.py"
I use now:
qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot
$HOME/pywikipedia/bot_control.py -default -cron
(a little bit more of time and '-m a -b y')
And here my key question arises; you mentioned 'qacct' to get more info
(thanks for this hint) and this is one of the biggest problem I had with
the whole SGE stuff; I was not able to get a complete docu whether on
the toolserver nor else. At the moment, on the toolserver commands like
'qstat' or 'qdel' are not covered anymore.
I (we) would like to know more about this great system.
E.g. what is the analogue to the old commands:
'cronsub [jobname] [command]'
has become
'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname]
[command]'
'cronsub -l [jobname] [command]'
has become
'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname]
[command]'
as far as I can see... (do not remember what the '-s' was for...)
Is this correct?
Thanks for your work Merlissimo
and greetings
DrTrigon
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org/
iEYEARECAAYFAk9SpnkACgkQAXWvBxzBrDBfoACgjkU9Cq/BT7eRp5RokONOxb5K
GekAniUSTuTaTKufOyjD9+lGiqmRRVDw
=g9x4
-----END PGP SIGNATURE-----