Re: [Toolserver-l] sge scheduler problem (was: SGE thinks hosts are overloaded while the latter are idling)

17 May 2013

      I too can't run, as Alebot,  a very simple IRC script (that needs longrun);
qstat states that it remains into a qw status, and qstat -j tells something
exoteric mentioning "overload".
I'll follow this thread to see if the issue will be solved.
Alex
2013/5/17 Merlissimo merl@toolserver.org
...
That is a sge scheduler problem.
I could not commend your sge ticket because jira does not accept my jira
token. The load limit is set ok because we use np_load_* values which is
the load divided by the number of cores on this host. So e.g. sge stop
scheduling jobs on nightshade if host load is more than 20. So i think
increasing this value does not make sense.
You output below contains load adjustments:
  queue instance "longrun-lx@yarrow.toolserver.**orglongrun-lx@yarrow.toolserver.org"
dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8

16.000000 with nproc=4) >= 3.1

means that there is a normalized host load of 0.015000 on yarrow and 16
jobs are started within the last 4,5 minutes (=load_adjustment_time). sge
temporary (for the first 4,5 minutes of a job lifetime) adds some expected
load for new jobs to be not overloaded in future. Most new jobs normally
needs some starting until they really use all need resources. This prevents
scheduling to much jobs at once to one execd client.
But as you can also see in real there are no new jobs. This is problem the
response from master:
$qping -info damiana 536 qmaster 1
05/17/2013 07:03:14:
SIRM version:             0.1
SIRM message id:          1
start time:               05/15/2013 23:47:49 (1368661669)
run time [s]:             112525
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 8
status:                   1
info:                     MAIN: E (112524.48) | signaler000: E (112523.98)
| event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) |
worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) |
listener001: E (5.03) | WARNING
All theads are in error state including the scheduler thread. So the
schedular does not accept status updates send by all execd and so it does
not know about finished jobs and load updates. Thats why you see on qstat
output an (not existing) overload problem and no running jobs (although
some old long running jobs are still running).
I think this could be solved by restarting the master scheduler process.
That is why i (as sge operator) send a kill command to the scheduler on
damiana and hoped that the ha_cluster automatically restarts this
process/service. But this is sadly not the case. So we have to wait until a
ts admin can restart this service manually.
In between submitting new jobs will return an error, sorry for that.
All running or queued jobs are not affected and will keep running or
queued.
Merlissimo
Am 17.05.2013 03:41, schrieb Tim Landscheidt:
...
Hi,
a "qstat -j" of a simple job yields inter alia:
| scheduling info:            queue instance "longrun-sol@willow.**
toolserver.org longrun-sol@willow.toolserver.org" dropped because it
is temporarily not available
|                             queue instance "
short-sol@willow.toolserver.**org short-sol@willow.toolserver.org"
dropped because it is temporarily not available
|                             queue instance "medium-lx@mayapple.**
toolserver.org medium-lx@mayapple.toolserver.org" dropped because it
is temporarily not available
|                             queue instance "longrun3-sol@willow.**
toolserver.org longrun3-sol@willow.toolserver.org" dropped because it
is temporarily not available
|                             queue instance "longrun2-sol@clematis.**
toolserver.org longrun2-sol@clematis.toolserver.org" dropped because
it is disabled
|                             queue instance "longrun2-sol@hawthorn.**
toolserver.org longrun2-sol@hawthorn.toolserver.org" dropped because
it is disabled
|                             queue instance "medium-sol@ortelius.**
toolserver.org medium-sol@ortelius.toolserver.org" dropped because it
is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with
nproc=4) >= 0.75
|                             queue instance "
medium-lx@yarrow.toolserver.**org medium-lx@yarrow.toolserver.org"
dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8

6.000000 with nproc=4) >= 1.2

|                             queue instance "medium-lx@nightshade.**
toolserver.org medium-lx@nightshade.toolserver.org" dropped because it
is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with
nproc=8) >= 1.2
|                             queue instance "medium-sol@wolfsbane.**
toolserver.org medium-sol@wolfsbane.toolserver.org" dropped because it
is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with
nproc=8) >= 0.75
|                             queue instance "short-sol@wolfsbane.**
toolserver.org short-sol@wolfsbane.toolserver.org" dropped because it
is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with
nproc=8) >= 1.2
|                             queue instance "short-sol@ortelius.**
toolserver.org short-sol@ortelius.toolserver.org" dropped because it
is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with
nproc=4) >= 1.2
|                             queue instance "
longrun-lx@yarrow.toolserver.**org longrun-lx@yarrow.toolserver.org"
dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8

16.000000 with nproc=4) >= 3.1

|                             queue instance "longrun-lx@nightshade.**
toolserver.org longrun-lx@nightshade.toolserver.org" dropped because
it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M *
29.000000) <= 500
At the moment, we have /no/ jobs scheduled by SGE running.
Meanwhile, the hosts are idling:
| queuename                      qtype resv/used/tot. load_avg arch
   states
| ------------------------------**------------------------------**

| short-sol@ortelius.toolserver. B     0/0/8          1.52     sol-amd64
| ------------------------------**------------------------------**

| short-sol@willow.toolserver.or B     0/0/8          -NA-     sol-amd64
    au
| ------------------------------**------------------------------**

| short-sol@wolfsbane.toolserver B     0/0/12         0.64     sol-amd64
| ------------------------------**------------------------------**

| medium-lx@mayapple.toolserver. B     0/0/32         -NA-     linux-x64
    adu
| ------------------------------**------------------------------**

| medium-lx@nightshade.toolserve B     0/0/8          1.05     linux-x64
| ------------------------------**------------------------------**

| medium-lx@yarrow.toolserver.or B     0/0/8          0.02     linux-x64
| ------------------------------**------------------------------**

| longrun-lx@nightshade.toolserv BI    0/0/64         1.05     linux-x64
| ------------------------------**------------------------------**

| longrun-lx@yarrow.toolserver.o BI    0/0/64         0.02     linux-x64
| ------------------------------**------------------------------**

| longrun-sol@willow.toolserver. BI    0/0/64         -NA-     sol-amd64
    au
| ------------------------------**------------------------------**

| medium-sol@ortelius.toolserver B     0/0/4          1.52     sol-amd64
| ------------------------------**------------------------------**

| medium-sol@wolfsbane.toolserve B     0/0/4          0.64     sol-amd64
| ------------------------------**------------------------------**

| longrun2-sol@clematis.toolserv B     0/0/8          0.03     sol-amd64
    d
| ------------------------------**------------------------------**

| longrun2-sol@hawthorn.toolserv B     0/0/8          0.23     sol-amd64
    d
| ------------------------------**------------------------------**

| longrun3-sol@willow.toolserver B     0/0/4          -NA-     sol-amd64
    aduE
I filed https://jira.toolserver.org/**browse/TS-1650https://jira.toolserver.org/browse/TS-1650on Monday
to no avail so far.
Tim
______________________________**_________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.**orgToolserver-l@lists.wikimedia.org
)
https://lists.wikimedia.org/**mailman/listinfo/toolserver-lhttps://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/**
view/Mailing_list_etiquettehttps://wiki.toolserver.org/view/Mailing_list_etiquette
______________________________**_________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.**orgToolserver-l@lists.wikimedia.org
)
https://lists.wikimedia.org/**mailman/listinfo/toolserver-lhttps://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/**
view/Mailing_list_etiquettehttps://wiki.toolserver.org/view/Mailing_list_etiquette

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] sge scheduler problem (was: SGE thinks hosts are overloaded while the latter are idling)