(anonymous) wrote:
That is a sge scheduler problem.
I could not commend your sge ticket because jira does not accept my jira token. The load limit is set ok because we use np_load_* values which is the load divided by the number of cores on this host. So e.g. sge stop scheduling jobs on nightshade if host load is more than 20. So i think increasing this value does not make sense.
You're probably right about this. I was assuming the load goal of 2 just from the symptoms displayed.
You output below contains load adjustments: queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000
- 0.8 * 16.000000 with nproc=4) >= 3.1
means that there is a normalized host load of 0.015000 on yarrow and 16 jobs are started within the last 4,5 minutes (=load_adjustment_time). sge temporary (for the first 4,5 minutes of a job lifetime) adds some expected load for new jobs to be not overloaded in future. Most new jobs normally needs some starting until they really use all need resources. This prevents scheduling to much jobs at once to one execd client.
But as you can also see in real there are no new jobs. This is problem the response from master:
$qping -info damiana 536 qmaster 1 05/17/2013 07:03:14: SIRM version: 0.1 SIRM message id: 1 start time: 05/15/2013 23:47:49 (1368661669) run time [s]: 112525 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 8 status: 1 info: MAIN: E (112524.48) | signaler000: E (112523.98) | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) | worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) | listener001: E (5.03) | WARNING
All theads are in error state including the scheduler thread. So the schedular does not accept status updates send by all execd and so it does not know about finished jobs and load updates. Thats why you see on qstat output an (not existing) overload problem and no running jobs (although some old long running jobs are still running).
I think this could be solved by restarting the master scheduler process. That is why i (as sge operator) send a kill command to the scheduler on damiana and hoped that the ha_cluster automatically restarts this process/service. But this is sadly not the case. So we have to wait until a ts admin can restart this service manually.
In between submitting new jobs will return an error, sorry for that. All running or queued jobs are not affected and will keep running or queued.
[...]
Thanks for tracking this down! Looking at qstat -u *, it seems to have recovered now.
Tim
P. S.: Regarding JIRA, did I miss any followup to http://permalink.gmane.org/gmane.org.wikimedia.toolserver/5241?