I've noticed that one of SuggestBot's hourly jobs has stalled for the past 7 hours, stuck in the "qw" state. Usually it runs like clockwork. Is there a problem with the SGE queues?
Regards, Morten
Hello,
I cant really see any trouble with SGE. Can you please tell me which user runs which command on which host so I can have a closer look?
Regards nosy
On Mon, 3 Dec 2012, Morten Wang wrote:
Date: Tue, 4 Dec 2012 05:50:36 From: Morten Wang nettrom@gmail.com Reply-To: Wikimedia Toolserver toolserver-l@lists.wikimedia.org To: Wikimedia Toolserver toolserver-l@lists.wikimedia.org Subject: [Toolserver-l] SGE queues stalled
I've noticed that one of SuggestBot's hourly jobs has stalled for the past 7 hours, stuck in the "qw" state. Usually it runs like clockwork. Is there a problem with the SGE queues?
Regards, Morten
Looks like the issue got resolved around 09:00UTC, as from the qacct output:
jobname opentasks jobnumber 873860 [...] qsub_time Mon Dec 3 22:19:03 2012 start_time Tue Dec 4 09:06:32 2012 end_time Tue Dec 4 09:21:18 2012
If you want to look into it more closely, this job was submitted by me (user: nettrom) through my crontab on the submit servers.
Cheers, Morten
On 4 December 2012 03:39, Marlen Caemmerer marlen.caemmerer@wikimedia.dewrote:
Hello,
I cant really see any trouble with SGE. Can you please tell me which user runs which command on which host so I can have a closer look?
Regards nosy
On Mon, 3 Dec 2012, Morten Wang wrote:
Date: Tue, 4 Dec 2012 05:50:36
From: Morten Wang nettrom@gmail.com Reply-To: Wikimedia Toolserver <toolserver-l@lists.wikimedia.**orgtoolserver-l@lists.wikimedia.org
To: Wikimedia Toolserver <toolserver-l@lists.wikimedia.**orgtoolserver-l@lists.wikimedia.org
Subject: [Toolserver-l] SGE queues stalled
I've noticed that one of SuggestBot's hourly jobs has stalled for the past 7 hours, stuck in the "qw" state. Usually it runs like clockwork. Is there a problem with the SGE queues?
Regards, Morten
______________________________**_________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.**orgToolserver-l@lists.wikimedia.org ) https://lists.wikimedia.org/**mailman/listinfo/toolserver-lhttps://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/** view/Mailing_list_etiquettehttps://wiki.toolserver.org/view/Mailing_list_etiquette
Server sql-s1-rr was unavailable during the night. So resource sql-s1-rr was 0.
Because i am not a ts admin i could not check that you requested this resource for this jobs. But just now nosy had a look and confirmed my suspicion. The job was started after resource sql-s1-rr was available again.
Merlissimo
Am 04.12.2012 16:44, schrieb Morten Wang:
Looks like the issue got resolved around 09:00UTC, as from the qacct output:
jobname opentasks jobnumber 873860 [...] qsub_time Mon Dec 3 22:19:03 2012 start_time Tue Dec 4 09:06:32 2012 end_time Tue Dec 4 09:21:18 2012
If you want to look into it more closely, this job was submitted by me (user: nettrom) through my crontab on the submit servers.
Cheers, Morten
Ah, didn't think of that, of course the obvious explanation. Thanks for looking into that!
Is there a way for me to find that out myself, e.g. using qstat? I had a look at the qstat man-page, but judging by the descriptions it looks like something I'd have to fiddle around with if/when a job gets queued for a long time at some point in the future to figure out how to do.
Regards, Morten
On 5 December 2012 07:11, Merlissimo merl@toolserver.org wrote:
Server sql-s1-rr was unavailable during the night. So resource sql-s1-rr was 0.
Because i am not a ts admin i could not check that you requested this resource for this jobs. But just now nosy had a look and confirmed my suspicion. The job was started after resource sql-s1-rr was available again.
Merlissimo
Am 04.12.2012 16:44, schrieb Morten Wang:
Looks like the issue got resolved around 09:00UTC, as from the qacct
output:
jobname opentasks jobnumber 873860 [...] qsub_time Mon Dec 3 22:19:03 2012 start_time Tue Dec 4 09:06:32 2012 end_time Tue Dec 4 09:21:18 2012
If you want to look into it more closely, this job was submitted by me (user: nettrom) through my crontab on the submit servers.
Cheers, Morten
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Am 05.12.2012 16:21, schrieb Morten Wang:
Is there a way for me to find that out myself, e.g. using qstat? I had a look at the qstat man-page, but judging by the descriptions it looks like something I'd have to fiddle around with if/when a job gets queued for a long time at some point in the future to figure out how to do.
qstat -j <jobnumber>
lists a scheduling info section.
Example: qstat -j 799111
scheduling info:
queue instance "short-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.252930 (= 1.252930 + 0.8 * 0.000000 with nproc=4) >= 1.1 queue instance "longrun-sol@willow.toolserver.org" dropped because it is overloaded: np_load_short=2.528320 (= 2.528320 + 0.8 * 0.000000 with nproc=8) >= 2.0 queue instance "medium-sol@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.252930 (= 1.252930 + 0.8 * 0.000000 with nproc=4) >= 0.8 queue instance "longrun2-sol@clematis.toolserver.org" dropped because it is disabled queue instance "longrun2-sol@hawthorn.toolserver.org" dropped because it is disabled (-l h_rt=57600,mem_free=890M,sql=1,sql-s7-rr=3,sqlprocs-s7=3,tmp_free=20M,user_slot=2,virtual_free=890M) cannot run globally because it offers only gc:sql-s7-rr=0.000000
As you can see the job cannot run on clematis and hawthorn, because these queues are disabled. queues on willow and ortelius have temporary high load. wolfsbane, nightshade and yarrow are missing in this list so the bot could start on these servers. But the last line "cannot run globally because it offers only gc:sql-s7-rr=0.000000" shows that resource sql-s7-rr is not available on any server at the moment. That's why the job is queued until s7 database is usable again.
Merlissimo
toolserver-l@lists.wikimedia.org