Re: [Toolserver-l] sge scheduler problem

17 May 2013


      (anonymous) wrote:
...
That is a sge scheduler problem.
...
I could not commend your sge ticket because jira does not
accept my jira token. The load limit is set ok because we
use np_load_* values which is the load divided by the number
of cores on this host. So e.g. sge stop scheduling jobs on
nightshade if host load is more than 20. So i think
increasing this value does not make sense.
You're probably right about this.  I was assuming the load
goal of 2 just from the symptoms displayed.
...
You output below contains load adjustments:
  queue instance "longrun-lx@yarrow.toolserver.org" dropped
because it is overloaded: np_load_short=3.215000 (= 0.015000

0.8 * 16.000000 with nproc=4) >= 3.1

means that there is a normalized host load of 0.015000 on
yarrow and 16 jobs are started within the last 4,5 minutes
(=load_adjustment_time). sge temporary (for the first 4,5
minutes of a job lifetime) adds some expected load for new
jobs to be not overloaded in future. Most new jobs normally
needs some starting until they really use all need
resources. This prevents scheduling to much jobs at once to
one execd client.
...
But as you can also see in real there are no new jobs. This
is problem the response from master:
...
$qping -info damiana 536 qmaster 1
05/17/2013 07:03:14:
SIRM version:             0.1
SIRM message id:          1
start time:               05/15/2013 23:47:49 (1368661669)
run time [s]:             112525
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 8
status:                   1
info:                     MAIN: E (112524.48) | signaler000:
E (112523.98) | event_master000: E (0.27) | timer000: E
(4.27) | worker000: E (7.05) | worker001: E (8.93) |
listener000: E (1.03) | scheduler000: E (8.93) |
listener001: E (5.03) | WARNING
...
All theads are in error state including the scheduler
thread. So the schedular does not accept status updates send
by all execd and so it does not know about finished jobs and
load updates. Thats why you see on qstat output an (not
existing) overload problem and no running jobs (although
some old long running jobs are still running).
...
I think this could be solved by restarting the master scheduler process.
That is why i (as sge operator) send a kill command to the
scheduler on damiana and hoped that the ha_cluster
automatically restarts this process/service. But this is
sadly not the case. So we have to wait until a ts admin can
restart this service manually.
...
In between submitting new jobs will return an error, sorry for that.
All running or queued jobs are not affected and will keep running or queued.
...
[...]
Thanks for tracking this down!  Looking at qstat -u *, it
seems to have recovered now.
Tim
P. S.: Regarding JIRA, did I miss any followup to
       http://permalink.gmane.org/gmane.org.wikimedia.toolserver/5241?

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] sge scheduler problem