-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 03.03.2012 22:46, Merlissimo wrote:
Hello toolserver users,
as you may know, there were some bigger problems related to sun grid engine starting in november 2011. I asked DaB. to become a sge manager for helping them to solve these problems. During the last months i silently started reconfiguring sge in small steps so that it was always possible to use it as before and no downtime was needed. This took some time because i am only a volunteer and i had to changes nearly everything. Additional Nosy and DaB. changed some solaris configurations that i proposed.
All scripts that used grid engine before can continue to run without changes. But maybe you can increase your script performance by adding additional informations.
In the past you were requested to choose a suitable queue (all.q or longrun) for your job. Many people choosed a queue that did not fit best for their task. So i changed this procedure.
Now you have to add all resources that your job needs during runtime on job submition. Then sge will choose queue and host that fits best for your requirements. So you don't have to care about different queues anymore (you may have seen that there are much more queues than before).
All jobs must at least contain informations about maximum runtime (h_rt) and peak memory usage (virtual_free). This information may get obligatory in future. Currently only a warning message is shown. You also have to request other resources like sql connections, free temp space, etc. if these are needed by your job. Please read documentation on toolserverwiki i have updated today: https://wiki.toolserver.org/view/Job_scheduling This currently contains the main informations you need to know, but maybe i add some more examples later.
I also have added a new script called "qcronsub". This is the replacement for "cronsub" most of you used before. Differently to cronsub it accepts the same arguments as the original "qsub" command by grid engine. So now it is possible the add all resource values at command line.
Please note that you should always use cronie at submit.toolserver.org for submitting jobs to sge by cron. These cron tasks will always be executed even if one host (e.g. clematis or willow) is down. This is the suggested usage since about 17 months. Many people have migrated their cron jobs from nightshade to willow during the last weeks. But they will have the same problem again if willow must be shut down for a longer time (which hopefully never happens).
First thanks a lot for your effort!! First I thought why making it even more complicate (this increases the possibility of mal-config) but after setting my cron(ie)tab up I have to say it makes sense! Will be a good thing even thought not simple.
-- Example:
This morning Dr. Trigon complained that his job "mainbot" did not run immediatly and was queued for a long time. I would guess he submitted his job from cron using "cronsub mainbot -l /home/drtrigon/pywikipedia/mainbot.py" This indicates that the job runs forevery (longrun) with unkown memory usage. So grid engine was only able to start this job on willow. It is not possible to run infinite job on the webservers (only shorter jobs are allowed so that most jobs have finished before high webserver usage is expected during the evening). Nor it was possible to run it on the server running mail transfer agent which only have less than 500MB memory free, but much cpu power (expected memory usage is unkown). Other servers like nightshade and yarrow aren't currently available.
Thanks for taking me as an example - that help a lot... ;))
The exact command was: cronsub -sl mainbot $HOME/pywikipedia/bot_control.py -default -cron (very close... ;)
According to the last run of this job it takes about 2 hours and 30 minutes runtime and had a peek usage of 370 MB memory. I got these values by requesting grid engine about usage statistics of the last ten days: "qacct -j mainbot -d 10". To be safe that the job gets always enough resouces i would suggest to raise the values to 4 hours and 500MB memory. It is not a problem if you request more resouces than really needed, but job needing more resources than requested may be killed. So the new submit command would be:
"qcronsub -N mainbot -l h_rt=4:00:00 -l virtual_free=500MB /home/drtrigon/pywikipedia/mainbot.py"
I use now: qcronsub -l h_rt=12:00:00 -l virtual_free=500M -m a -b y -N mainbot $HOME/pywikipedia/bot_control.py -default -cron (a little bit more of time and '-m a -b y')
And here my key question arises; you mentioned 'qacct' to get more info (thanks for this hint) and this is one of the biggest problem I had with the whole SGE stuff; I was not able to get a complete docu whether on the toolserver nor else. At the moment, on the toolserver commands like 'qstat' or 'qdel' are not covered anymore. I (we) would like to know more about this great system.
E.g. what is the analogue to the old commands:
'cronsub [jobname] [command]' has become 'qcronsub -l h_rt=06:00:00 -l virtual_free=100M -m a -b y -N [jobname] [command]'
'cronsub -l [jobname] [command]' has become 'qcronsub -l h_rt=INFINITY -l virtual_free=100M -m a -b y -N [jobname] [command]'
as far as I can see... (do not remember what the '-s' was for...) Is this correct?
Thanks for your work Merlissimo and greetings DrTrigon