Hi,
a "qstat -j" of a simple job yields inter alia:
| scheduling info: queue instance "longrun-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "short-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "medium-lx@mayapple.toolserver.org" dropped because it is temporarily not available | queue instance "longrun3-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "longrun2-sol@clematis.toolserver.org" dropped because it is disabled | queue instance "longrun2-sol@hawthorn.toolserver.org" dropped because it is disabled | queue instance "medium-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with nproc=4) >= 0.75 | queue instance "medium-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8 * 6.000000 with nproc=4) >= 1.2 | queue instance "medium-lx@nightshade.toolserver.org" dropped because it is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with nproc=8) >= 1.2 | queue instance "medium-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with nproc=8) >= 0.75 | queue instance "short-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with nproc=8) >= 1.2 | queue instance "short-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with nproc=4) >= 1.2 | queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 * 16.000000 with nproc=4) >= 3.1 | queue instance "longrun-lx@nightshade.toolserver.org" dropped because it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M * 29.000000) <= 500
At the moment, we have /no/ jobs scheduled by SGE running. Meanwhile, the hosts are idling:
| queuename qtype resv/used/tot. load_avg arch states | --------------------------------------------------------------------------------- | short-sol@ortelius.toolserver. B 0/0/8 1.52 sol-amd64 | --------------------------------------------------------------------------------- | short-sol@willow.toolserver.or B 0/0/8 -NA- sol-amd64 au | --------------------------------------------------------------------------------- | short-sol@wolfsbane.toolserver B 0/0/12 0.64 sol-amd64 | --------------------------------------------------------------------------------- | medium-lx@mayapple.toolserver. B 0/0/32 -NA- linux-x64 adu | --------------------------------------------------------------------------------- | medium-lx@nightshade.toolserve B 0/0/8 1.05 linux-x64 | --------------------------------------------------------------------------------- | medium-lx@yarrow.toolserver.or B 0/0/8 0.02 linux-x64 | --------------------------------------------------------------------------------- | longrun-lx@nightshade.toolserv BI 0/0/64 1.05 linux-x64 | --------------------------------------------------------------------------------- | longrun-lx@yarrow.toolserver.o BI 0/0/64 0.02 linux-x64 | --------------------------------------------------------------------------------- | longrun-sol@willow.toolserver. BI 0/0/64 -NA- sol-amd64 au | --------------------------------------------------------------------------------- | medium-sol@ortelius.toolserver B 0/0/4 1.52 sol-amd64 | --------------------------------------------------------------------------------- | medium-sol@wolfsbane.toolserve B 0/0/4 0.64 sol-amd64 | --------------------------------------------------------------------------------- | longrun2-sol@clematis.toolserv B 0/0/8 0.03 sol-amd64 d | --------------------------------------------------------------------------------- | longrun2-sol@hawthorn.toolserv B 0/0/8 0.23 sol-amd64 d | --------------------------------------------------------------------------------- | longrun3-sol@willow.toolserver B 0/0/4 -NA- sol-amd64 aduE
I filed https://jira.toolserver.org/browse/TS-1650 on Monday to no avail so far.
Tim
That is a sge scheduler problem.
I could not commend your sge ticket because jira does not accept my jira token. The load limit is set ok because we use np_load_* values which is the load divided by the number of cores on this host. So e.g. sge stop scheduling jobs on nightshade if host load is more than 20. So i think increasing this value does not make sense.
You output below contains load adjustments: queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 * 16.000000 with nproc=4) >= 3.1 means that there is a normalized host load of 0.015000 on yarrow and 16 jobs are started within the last 4,5 minutes (=load_adjustment_time). sge temporary (for the first 4,5 minutes of a job lifetime) adds some expected load for new jobs to be not overloaded in future. Most new jobs normally needs some starting until they really use all need resources. This prevents scheduling to much jobs at once to one execd client.
But as you can also see in real there are no new jobs. This is problem the response from master:
$qping -info damiana 536 qmaster 1 05/17/2013 07:03:14: SIRM version: 0.1 SIRM message id: 1 start time: 05/15/2013 23:47:49 (1368661669) run time [s]: 112525 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 8 status: 1 info: MAIN: E (112524.48) | signaler000: E (112523.98) | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) | worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) | listener001: E (5.03) | WARNING
All theads are in error state including the scheduler thread. So the schedular does not accept status updates send by all execd and so it does not know about finished jobs and load updates. Thats why you see on qstat output an (not existing) overload problem and no running jobs (although some old long running jobs are still running).
I think this could be solved by restarting the master scheduler process. That is why i (as sge operator) send a kill command to the scheduler on damiana and hoped that the ha_cluster automatically restarts this process/service. But this is sadly not the case. So we have to wait until a ts admin can restart this service manually.
In between submitting new jobs will return an error, sorry for that. All running or queued jobs are not affected and will keep running or queued.
Merlissimo
Am 17.05.2013 03:41, schrieb Tim Landscheidt:
Hi,
a "qstat -j" of a simple job yields inter alia:
| scheduling info: queue instance "longrun-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "short-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "medium-lx@mayapple.toolserver.org" dropped because it is temporarily not available | queue instance "longrun3-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "longrun2-sol@clematis.toolserver.org" dropped because it is disabled | queue instance "longrun2-sol@hawthorn.toolserver.org" dropped because it is disabled | queue instance "medium-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with nproc=4) >= 0.75 | queue instance "medium-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8 * 6.000000 with nproc=4) >= 1.2 | queue instance "medium-lx@nightshade.toolserver.org" dropped because it is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with nproc=8) >= 1.2 | queue instance "medium-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with nproc=8) >= 0.75 | queue instance "short-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with nproc=8) >= 1.2 | queue instance "short-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with nproc=4) >= 1.2 | queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 * 16.000000 with nproc=4) >= 3.1 | queue instance "longrun-lx@nightshade.toolserver.org" dropped because it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M * 29.000000) <= 500
At the moment, we have /no/ jobs scheduled by SGE running. Meanwhile, the hosts are idling:
| queuename qtype resv/used/tot. load_avg arch states | --------------------------------------------------------------------------------- | short-sol@ortelius.toolserver. B 0/0/8 1.52 sol-amd64 | --------------------------------------------------------------------------------- | short-sol@willow.toolserver.or B 0/0/8 -NA- sol-amd64 au | --------------------------------------------------------------------------------- | short-sol@wolfsbane.toolserver B 0/0/12 0.64 sol-amd64 | --------------------------------------------------------------------------------- | medium-lx@mayapple.toolserver. B 0/0/32 -NA- linux-x64 adu | --------------------------------------------------------------------------------- | medium-lx@nightshade.toolserve B 0/0/8 1.05 linux-x64 | --------------------------------------------------------------------------------- | medium-lx@yarrow.toolserver.or B 0/0/8 0.02 linux-x64 | --------------------------------------------------------------------------------- | longrun-lx@nightshade.toolserv BI 0/0/64 1.05 linux-x64 | --------------------------------------------------------------------------------- | longrun-lx@yarrow.toolserver.o BI 0/0/64 0.02 linux-x64 | --------------------------------------------------------------------------------- | longrun-sol@willow.toolserver. BI 0/0/64 -NA- sol-amd64 au | --------------------------------------------------------------------------------- | medium-sol@ortelius.toolserver B 0/0/4 1.52 sol-amd64 | --------------------------------------------------------------------------------- | medium-sol@wolfsbane.toolserve B 0/0/4 0.64 sol-amd64 | --------------------------------------------------------------------------------- | longrun2-sol@clematis.toolserv B 0/0/8 0.03 sol-amd64 d | --------------------------------------------------------------------------------- | longrun2-sol@hawthorn.toolserv B 0/0/8 0.23 sol-amd64 d | --------------------------------------------------------------------------------- | longrun3-sol@willow.toolserver B 0/0/4 -NA- sol-amd64 aduE
I filed https://jira.toolserver.org/browse/TS-1650 on Monday to no avail so far.
Tim
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
I too can't run, as Alebot, a very simple IRC script (that needs longrun); qstat states that it remains into a qw status, and qstat -j tells something exoteric mentioning "overload".
I'll follow this thread to see if the issue will be solved.
Alex
2013/5/17 Merlissimo merl@toolserver.org
That is a sge scheduler problem.
I could not commend your sge ticket because jira does not accept my jira token. The load limit is set ok because we use np_load_* values which is the load divided by the number of cores on this host. So e.g. sge stop scheduling jobs on nightshade if host load is more than 20. So i think increasing this value does not make sense.
You output below contains load adjustments: queue instance "longrun-lx@yarrow.toolserver.**orglongrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8
- 16.000000 with nproc=4) >= 3.1
means that there is a normalized host load of 0.015000 on yarrow and 16 jobs are started within the last 4,5 minutes (=load_adjustment_time). sge temporary (for the first 4,5 minutes of a job lifetime) adds some expected load for new jobs to be not overloaded in future. Most new jobs normally needs some starting until they really use all need resources. This prevents scheduling to much jobs at once to one execd client.
But as you can also see in real there are no new jobs. This is problem the response from master:
$qping -info damiana 536 qmaster 1 05/17/2013 07:03:14: SIRM version: 0.1 SIRM message id: 1 start time: 05/15/2013 23:47:49 (1368661669) run time [s]: 112525 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 8 status: 1 info: MAIN: E (112524.48) | signaler000: E (112523.98) | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) | worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) | listener001: E (5.03) | WARNING
All theads are in error state including the scheduler thread. So the schedular does not accept status updates send by all execd and so it does not know about finished jobs and load updates. Thats why you see on qstat output an (not existing) overload problem and no running jobs (although some old long running jobs are still running).
I think this could be solved by restarting the master scheduler process. That is why i (as sge operator) send a kill command to the scheduler on damiana and hoped that the ha_cluster automatically restarts this process/service. But this is sadly not the case. So we have to wait until a ts admin can restart this service manually.
In between submitting new jobs will return an error, sorry for that. All running or queued jobs are not affected and will keep running or queued.
Merlissimo
Am 17.05.2013 03:41, schrieb Tim Landscheidt:
Hi,
a "qstat -j" of a simple job yields inter alia:
| scheduling info: queue instance "longrun-sol@willow.** toolserver.org longrun-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance " short-sol@willow.toolserver.**org short-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "medium-lx@mayapple.** toolserver.org medium-lx@mayapple.toolserver.org" dropped because it is temporarily not available | queue instance "longrun3-sol@willow.** toolserver.org longrun3-sol@willow.toolserver.org" dropped because it is temporarily not available | queue instance "longrun2-sol@clematis.** toolserver.org longrun2-sol@clematis.toolserver.org" dropped because it is disabled | queue instance "longrun2-sol@hawthorn.** toolserver.org longrun2-sol@hawthorn.toolserver.org" dropped because it is disabled | queue instance "medium-sol@ortelius.** toolserver.org medium-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with nproc=4) >= 0.75 | queue instance " medium-lx@yarrow.toolserver.**org medium-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8
- 6.000000 with nproc=4) >= 1.2
| queue instance "medium-lx@nightshade.** toolserver.org medium-lx@nightshade.toolserver.org" dropped because it is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with nproc=8) >= 1.2 | queue instance "medium-sol@wolfsbane.** toolserver.org medium-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with nproc=8) >= 0.75 | queue instance "short-sol@wolfsbane.** toolserver.org short-sol@wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with nproc=8) >= 1.2 | queue instance "short-sol@ortelius.** toolserver.org short-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with nproc=4) >= 1.2 | queue instance " longrun-lx@yarrow.toolserver.**org longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8
- 16.000000 with nproc=4) >= 3.1
| queue instance "longrun-lx@nightshade.** toolserver.org longrun-lx@nightshade.toolserver.org" dropped because it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M * 29.000000) <= 500
At the moment, we have /no/ jobs scheduled by SGE running. Meanwhile, the hosts are idling:
| queuename qtype resv/used/tot. load_avg arch states | ------------------------------**------------------------------**
| short-sol@ortelius.toolserver. B 0/0/8 1.52 sol-amd64 | ------------------------------**------------------------------**
| short-sol@willow.toolserver.or B 0/0/8 -NA- sol-amd64 au | ------------------------------**------------------------------**
| short-sol@wolfsbane.toolserver B 0/0/12 0.64 sol-amd64 | ------------------------------**------------------------------**
| medium-lx@mayapple.toolserver. B 0/0/32 -NA- linux-x64 adu | ------------------------------**------------------------------**
| medium-lx@nightshade.toolserve B 0/0/8 1.05 linux-x64 | ------------------------------**------------------------------**
| medium-lx@yarrow.toolserver.or B 0/0/8 0.02 linux-x64 | ------------------------------**------------------------------**
| longrun-lx@nightshade.toolserv BI 0/0/64 1.05 linux-x64 | ------------------------------**------------------------------**
| longrun-lx@yarrow.toolserver.o BI 0/0/64 0.02 linux-x64 | ------------------------------**------------------------------**
| longrun-sol@willow.toolserver. BI 0/0/64 -NA- sol-amd64 au | ------------------------------**------------------------------**
| medium-sol@ortelius.toolserver B 0/0/4 1.52 sol-amd64 | ------------------------------**------------------------------**
| medium-sol@wolfsbane.toolserve B 0/0/4 0.64 sol-amd64 | ------------------------------**------------------------------**
| longrun2-sol@clematis.toolserv B 0/0/8 0.03 sol-amd64 d | ------------------------------**------------------------------**
| longrun2-sol@hawthorn.toolserv B 0/0/8 0.23 sol-amd64 d | ------------------------------**------------------------------**
| longrun3-sol@willow.toolserver B 0/0/4 -NA- sol-amd64 aduE
I filed https://jira.toolserver.org/**browse/TS-1650https://jira.toolserver.org/browse/TS-1650on Monday to no avail so far.
Tim
______________________________**_________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.**orgToolserver-l@lists.wikimedia.org ) https://lists.wikimedia.org/**mailman/listinfo/toolserver-lhttps://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/** view/Mailing_list_etiquettehttps://wiki.toolserver.org/view/Mailing_list_etiquette
______________________________**_________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.**orgToolserver-l@lists.wikimedia.org ) https://lists.wikimedia.org/**mailman/listinfo/toolserver-lhttps://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/** view/Mailing_list_etiquettehttps://wiki.toolserver.org/view/Mailing_list_etiquette
(anonymous) wrote:
That is a sge scheduler problem.
I could not commend your sge ticket because jira does not accept my jira token. The load limit is set ok because we use np_load_* values which is the load divided by the number of cores on this host. So e.g. sge stop scheduling jobs on nightshade if host load is more than 20. So i think increasing this value does not make sense.
You're probably right about this. I was assuming the load goal of 2 just from the symptoms displayed.
You output below contains load adjustments: queue instance "longrun-lx@yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.215000 (= 0.015000
- 0.8 * 16.000000 with nproc=4) >= 3.1
means that there is a normalized host load of 0.015000 on yarrow and 16 jobs are started within the last 4,5 minutes (=load_adjustment_time). sge temporary (for the first 4,5 minutes of a job lifetime) adds some expected load for new jobs to be not overloaded in future. Most new jobs normally needs some starting until they really use all need resources. This prevents scheduling to much jobs at once to one execd client.
But as you can also see in real there are no new jobs. This is problem the response from master:
$qping -info damiana 536 qmaster 1 05/17/2013 07:03:14: SIRM version: 0.1 SIRM message id: 1 start time: 05/15/2013 23:47:49 (1368661669) run time [s]: 112525 messages in read buffer: 0 messages in write buffer: 0 nr. of connected clients: 8 status: 1 info: MAIN: E (112524.48) | signaler000: E (112523.98) | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) | worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) | listener001: E (5.03) | WARNING
All theads are in error state including the scheduler thread. So the schedular does not accept status updates send by all execd and so it does not know about finished jobs and load updates. Thats why you see on qstat output an (not existing) overload problem and no running jobs (although some old long running jobs are still running).
I think this could be solved by restarting the master scheduler process. That is why i (as sge operator) send a kill command to the scheduler on damiana and hoped that the ha_cluster automatically restarts this process/service. But this is sadly not the case. So we have to wait until a ts admin can restart this service manually.
In between submitting new jobs will return an error, sorry for that. All running or queued jobs are not affected and will keep running or queued.
[...]
Thanks for tracking this down! Looking at qstat -u *, it seems to have recovered now.
Tim
P. S.: Regarding JIRA, did I miss any followup to http://permalink.gmane.org/gmane.org.wikimedia.toolserver/5241?
toolserver-l@lists.wikimedia.org