2013/5/17 Merlissimo <merl@toolserver.org>

That is a sge scheduler problem.

I could not commend your sge ticket because jira does not accept my jira token. The load limit is set ok because we use np_load_* values which is the load divided by the number of cores on this host. So e.g. sge stop scheduling jobs on nightshade if host load is more than 20. So i think increasing this value does not make sense.

You output below contains load adjustments:
queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 * 16.000000 with nproc=4) >= 3.1
means that there is a normalized host load of 0.015000 on yarrow and 16 jobs are started within the last 4,5 minutes (=load_adjustment_time). sge temporary (for the first 4,5 minutes of a job lifetime) adds some expected load for new jobs to be not overloaded in future. Most new jobs normally needs some starting until they really use all need resources. This prevents scheduling to much jobs at once to one execd client.

But as you can also see in real there are no new jobs. This is problem the response from master:

$qping -info damiana 536 qmaster 1
05/17/2013 07:03:14:
SIRM version: 0.1
SIRM message id: 1
start time: 05/15/2013 23:47:49 (1368661669)
run time [s]: 112525
messages in read buffer: 0
messages in write buffer: 0
nr. of connected clients: 8
status: 1
info: MAIN: E (112524.48) | signaler000: E (112523.98) | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) | worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) | listener001: E (5.03) | WARNING

All theads are in error state including the scheduler thread. So the schedular does not accept status updates send by all execd and so it does not know about finished jobs and load updates. Thats why you see on qstat output an (not existing) overload problem and no running jobs (although some old long running jobs are still running).

I think this could be solved by restarting the master scheduler process.
That is why i (as sge operator) send a kill command to the scheduler on damiana and hoped that the ha_cluster automatically restarts this process/service. But this is sadly not the case. So we have to wait until a ts admin can restart this service manually.

In between submitting new jobs will return an error, sorry for that.
All running or queued jobs are not affected and will keep running or queued.

Merlissimo

Am 17.05.2013 03:41, schrieb Tim Landscheidt:

Hi,

a "qstat -j" of a simple job yields inter alia:

| scheduling info: queue instance "longrun-sol@willow.toolserver.org" dropped because it is temporarily not available
| queue instance "short-sol@willow.toolserver.org" dropped because it is temporarily not available
| queue instance "medium-lx@mayapple.toolserver.org" dropped because it is temporarily not available
| queue instance "longrun3-sol@willow.toolserver.org" dropped because it is temporarily not available
| queue instance "longrun2-sol@clematis.toolserver.org" dropped because it is disabled
| queue instance "longrun2-sol@hawthorn.toolserver.org" dropped because it is disabled
| queue instance "medium-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with nproc=4) >= 0.75
| queue instance "medium-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8 * 6.000000 with nproc=4) >= 1.2
| queue instance "medium-lx@nightshade.toolserver.org" dropped because it is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with nproc=8) >= 1.2
| queue instance "medium-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with nproc=8) >= 0.75
| queue instance "short-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with nproc=8) >= 1.2
| queue instance "short-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with nproc=4) >= 1.2
| queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 * 16.000000 with nproc=4) >= 3.1
| queue instance "longrun-lx@nightshade.toolserver.org" dropped because it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M * 29.000000) <= 500

At the moment, we have /no/ jobs scheduled by SGE running.
Meanwhile, the hosts are idling:

| queuename qtype resv/used/tot. load_avg arch states
| ---------------------------------------------------------------------------------
| short-sol@ortelius.toolserver. B 0/0/8 1.52 sol-amd64
| ---------------------------------------------------------------------------------
| short-sol@willow.toolserver.or B 0/0/8 -NA- sol-amd64 au
| ---------------------------------------------------------------------------------
| short-sol@wolfsbane.toolserver B 0/0/12 0.64 sol-amd64
| ---------------------------------------------------------------------------------
| medium-lx@mayapple.toolserver. B 0/0/32 -NA- linux-x64 adu
| ---------------------------------------------------------------------------------
| medium-lx@nightshade.toolserve B 0/0/8 1.05 linux-x64
| ---------------------------------------------------------------------------------
| medium-lx@yarrow.toolserver.or B 0/0/8 0.02 linux-x64
| ---------------------------------------------------------------------------------
| longrun-lx@nightshade.toolserv BI 0/0/64 1.05 linux-x64
| ---------------------------------------------------------------------------------
| longrun-lx@yarrow.toolserver.o BI 0/0/64 0.02 linux-x64
| ---------------------------------------------------------------------------------
| longrun-sol@willow.toolserver. BI 0/0/64 -NA- sol-amd64 au
| ---------------------------------------------------------------------------------
| medium-sol@ortelius.toolserver B 0/0/4 1.52 sol-amd64
| ---------------------------------------------------------------------------------
| medium-sol@wolfsbane.toolserve B 0/0/4 0.64 sol-amd64
| ---------------------------------------------------------------------------------
| longrun2-sol@clematis.toolserv B 0/0/8 0.03 sol-amd64 d
| ---------------------------------------------------------------------------------
| longrun2-sol@hawthorn.toolserv B 0/0/8 0.23 sol-amd64 d
| ---------------------------------------------------------------------------------
| longrun3-sol@willow.toolserver B 0/0/4 -NA- sol-amd64 aduE

I filed https://jira.toolserver.org/browse/TS-1650 on Monday
to no avail so far.

Tim

_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette