SGE queue waiting forever?

List overview All Threads
Download

newer

older

Cannot log in

Anyone else having TS trouble?

Dr. Trigon

24 Nov 2012 24 Nov '12

8:17 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hello

I have an issue with my jobs not executed or more precise queued since midnight:

job-ID prior name user state submit/start at queue slots ja-task-ID - ----------------------------------------------------------------------------------------------------------------- 786288 0.50539 ircbot drtrigon r 11/17/2012 00:18:04 longrun-sol@willow.toolserver. 1 825178 0.50042 subster_me drtrigon qw 11/24/2012 00:06:04 1 825207 0.50039 subster_en drtrigon qw 11/24/2012 01:06:03 1 825212 0.50038 subster_nl drtrigon qw 11/24/2012 01:36:03 1 825228 0.50037 mainbot drtrigon qw 11/24/2012 02:36:03 1 825106 0.50035 subster_ar drtrigon qw 11/23/2012 21:06:03 1 825177 0.50000 maintenanc drtrigon qw 11/24/2012 00:06:04 1 825191 0.00000 subster_fr drtrigon qw 11/24/2012 00:36:04 1

...what could be the issue here?

Thanks and greetings DrTrigon

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCw5PUACgkQAXWvBxzBrDA6IQCeNbMv7m11Pan5gJrrILATo3q6 m4EAnRX9gZR9uDi7nFSlywJlLzWOEhin =gLQA -----END PGP SIGNATURE-----

Show replies by date

Wolfgang Faust

24 Nov 24 Nov

9:37 a.m.

Logging in to submit.toolserver.org takes a really long time recently (starting a few days ago). Clematis doesn't seem to have any load though, so I don't know what's going on.

On Sat, Nov 24, 2012 at 10:17 AM, Dr. Trigon dr.trigon@surfeu.ch wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hello

I have an issue with my jobs not executed or more precise queued since midnight:

job-ID prior name user state submit/start at queue slots ja-task-ID

786288 0.50539 ircbot drtrigon r 11/17/2012 00:18:04 longrun-sol@willow.toolserver. 1 825178 0.50042 subster_me drtrigon qw 11/24/2012 00:06:04 1 825207 0.50039 subster_en drtrigon qw 11/24/2012 01:06:03 1 825212 0.50038 subster_nl drtrigon qw 11/24/2012 01:36:03 1 825228 0.50037 mainbot drtrigon qw 11/24/2012 02:36:03 1 825106 0.50035 subster_ar drtrigon qw 11/23/2012 21:06:03 1 825177 0.50000 maintenanc drtrigon qw 11/24/2012 00:06:04 1 825191 0.00000 subster_fr drtrigon qw 11/24/2012 00:36:04 1

...what could be the issue here?

Thanks and greetings DrTrigon -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCw5PUACgkQAXWvBxzBrDA6IQCeNbMv7m11Pan5gJrrILATo3q6 m4EAnRX9gZR9uDi7nFSlywJlLzWOEhin =gLQA -----END PGP SIGNATURE-----

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-- This message has been encoded in 128ROT13 for security. If you are unable to view it, please consult an optometrist.

Platonides

11:08 a.m.

On 24/11/12 17:37, Wolfgang Faust wrote:

...

Logging in to submit.toolserver.org http://submit.toolserver.org takes a really long time recently (starting a few days ago). Clematis doesn't seem to have any load though, so I don't know what's going on.

Thousands of processes running df -k /mnt/user-store? :)

$ ssh submit "ps -ef | grep -c 'df -k /mnt/user-store'" 15408

Seems we have /mnt/user-store problems again. clematis ~ $ time ls /mnt/user-store NFS server thyme not responding still trying NFS getattr failed for server thyme: error 16 (RPC: Failed (unspecified error)) ls: cannot access /mnt/user-store: Connection timed out

But those instances are stuck. There are processes since Nov 12. Seems that nosy just killed them.

Marlen Caemmerer

12:43 p.m.

Hello,

a broken nfs mount was the source of the slow login. Dont know if it affected SGE as well but I tried to mount the user-store and I got the error "Out of stream resources". There might be something fishy with the local disks too since cat /etc/vfstab took ages 2 times and ls resulted in "no such file or directory" twice too. But ipmi logs and the raid utility from solaris showed no faults. I rebooted and the system now seems to be running ok. Do you still see any issue?

Cheers nosy

On Sat, 24 Nov 2012, Wolfgang Faust wrote:

...

Date: Sat, 24 Nov 2012 17:37:31 From: Wolfgang Faust wolfgangmcq@gmail.com Reply-To: Wikimedia Toolserver toolserver-l@lists.wikimedia.org To: Wikimedia Toolserver toolserver-l@lists.wikimedia.org Subject: Re: [Toolserver-l] SGE queue waiting forever?

Logging in to submit.toolserver.org takes a really long time recently (starting a few days ago). Clematis doesn't seem to have any load though, so I don't know what's going on.

On Sat, Nov 24, 2012 at 10:17 AM, Dr. Trigon dr.trigon@surfeu.ch wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hello

I have an issue with my jobs not executed or more precise queued since midnight:

job-ID prior name user state submit/start at queue slots ja-task-ID

786288 0.50539 ircbot drtrigon r 11/17/2012 00:18:04 longrun-sol@willow.toolserver. 1 825178 0.50042 subster_me drtrigon qw 11/24/2012 00:06:04 1 825207 0.50039 subster_en drtrigon qw 11/24/2012 01:06:03 1 825212 0.50038 subster_nl drtrigon qw 11/24/2012 01:36:03 1 825228 0.50037 mainbot drtrigon qw 11/24/2012 02:36:03 1 825106 0.50035 subster_ar drtrigon qw 11/23/2012 21:06:03 1 825177 0.50000 maintenanc drtrigon qw 11/24/2012 00:06:04 1 825191 0.00000 subster_fr drtrigon qw 11/24/2012 00:36:04 1

...what could be the issue here?

Thanks and greetings DrTrigon -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCw5PUACgkQAXWvBxzBrDA6IQCeNbMv7m11Pan5gJrrILATo3q6 m4EAnRX9gZR9uDi7nFSlywJlLzWOEhin =gLQA -----END PGP SIGNATURE-----

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Merlissimo

1:15 p.m.

Am 24.11.2012 20:43, schrieb Marlen Caemmerer:

...

Hello,

a broken nfs mount was the source of the slow login. Dont know if it affected SGE as well but I tried to mount the user-store and I got the error "Out of stream resources". There might be something fishy with the local disks too since cat /etc/vfstab took ages 2 times and ls resulted in "no such file or directory" twice too. But ipmi logs and the raid utility from solaris showed no faults. I rebooted and the system now seems to be running ok. Do you still see any issue?

Cheers nosy

At 20:32 on Nov 23th sge on turnera stopped and was started at damiana. The qmaster thread started successfully because it responses pings and so on. But the scheduler thread seems not to work. qconf -tsm does not show any status information (which whould be written to logs when is send this command). That's why no new jobs are send to execution clients.

So the switch over on the ha-cluster failed.

Merlissimo

@All: If you are working on big files please copy them to local temp first (on sge $TMP contains an individual temp dir for the job). E.g. piping big files to other slow programs causes much nfs load because data must be read in small packages which cause high load on servers. That's why sge cannot schedule new jobs on nightshade since days.

Dr. Trigon

1:38 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 24.11.2012 21:15, Merlissimo wrote:

...

At 20:32 on Nov 23th sge on turnera stopped and was started at damiana. The qmaster thread started successfully because it responses pings and so on. But the scheduler thread seems not to work. qconf -tsm does not show any status information (which whould be written to logs when is send this command). That's why no new jobs are send to execution clients.

So the switch over on the ha-cluster failed.

...so is it supposed to be working now...?

...

@All: If you are working on big files please copy them to local temp first (on sge $TMP contains an individual temp dir for the job). E.g. piping big files to other slow programs causes much nfs load because data must be read in small packages which cause high load on servers. That's why sge cannot schedule new jobs on nightshade since days.

What is a big file? Is it ok if the file is in user-home?

Thanks and greetings DrTrigon

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCxMD8ACgkQAXWvBxzBrDCONACgyIeN8vDFAtJUcp//VXObBru0 EWEAoNXWUfHYjBKGa9DD6I/1mOh6mPI6 =Sl0r -----END PGP SIGNATURE-----

Platonides

2:10 p.m.

On 24/11/12 21:38, Dr. Trigon wrote:

...

...
@All: If you are working on big files please copy them to local temp first (on sge $TMP contains an individual temp dir for the job). E.g. piping big files to other slow programs causes much nfs load because data must be read in small packages which cause high load on servers. That's why sge cannot schedule new jobs on nightshade since days.

What is a big file? Is it ok if the file is in user-home?

Thanks and greetings DrTrigon

/home is also mounted with nfs

Although it's strange that reading from big files overloads the servers. stdio or the equivalent functionality in the language they are made should be making it work in blocks.

Looking at willow mounts, /shared and /home are mounted with nfsv3 over udp. But /mnt/user-store and /install don't show it, so they are probably using nfsv4 over tcp. Is that intended?

Dr. Trigon

25 Nov 25 Nov

3:33 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Today it seems to be working and fully functional again... Nice job! ;)

Thanks to all involved here!

Greetings and have a nice weekend DrTrigon

On 24.11.2012 22:10, Platonides wrote:

...

On 24/11/12 21:38, Dr. Trigon wrote:

...
...
@All: If you are working on big files please copy them to local temp first (on sge $TMP contains an individual temp dir for the job). E.g. piping big files to other slow programs causes much nfs load because data must be read in small packages which cause high load on servers. That's why sge cannot schedule new jobs on nightshade since days.

What is a big file? Is it ok if the file is in user-home?

Thanks and greetings DrTrigon

/home is also mounted with nfs

Although it's strange that reading from big files overloads the servers. stdio or the equivalent functionality in the language they are made should be making it work in blocks.

Looking at willow mounts, /shared and /home are mounted with nfsv3 over udp. But /mnt/user-store and /install don't show it, so they are probably using nfsv4 over tcp. Is that intended?

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCx8+8ACgkQAXWvBxzBrDBSyQCfc7mOdoj45Phyx0p+9Be5sm99 tdcAn0m3hTWswEuvfBGAIBlsmMW9uhNO =+rBS -----END PGP SIGNATURE-----

Marlen Caemmerer

3:03 p.m.

Hello,

nice to hear. I had a look from all sides but it seemed the SGE master thought the queues on the hosts were full. This morning when I looked I saw only willow doing some jobs - ortelius still having this strange state. I waited for Merl to advise me probably reconfiguring but it was too early this morning and I simply deleted some (5 i think) jobs from the queues that were issued on 9th Oct when user-store failed.

I felt the users might probably not wait for this job until now anyway and hoped the queues would regenerate as they were modified. This seems to have solved the problem to my luck ;) as in the logs it seems the jobs were running fine then.

Cheers nosy

On Sun, 25 Nov 2012, Dr. Trigon wrote:

...

Date: Sun, 25 Nov 2012 11:33:19 From: Dr. Trigon dr.trigon@surfeu.ch Reply-To: Wikimedia Toolserver toolserver-l@lists.wikimedia.org To: toolserver-l@lists.wikimedia.org Subject: Re: [Toolserver-l] SGE queue waiting forever?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Today it seems to be working and fully functional again... Nice job! ;)

Thanks to all involved here!

Greetings and have a nice weekend DrTrigon

On 24.11.2012 22:10, Platonides wrote:

...
On 24/11/12 21:38, Dr. Trigon wrote:

...
...
@All: If you are working on big files please copy them to local temp first (on sge $TMP contains an individual temp dir for the job). E.g. piping big files to other slow programs causes much nfs load because data must be read in small packages which cause high load on servers. That's why sge cannot schedule new jobs on nightshade since days.

What is a big file? Is it ok if the file is in user-home?

Thanks and greetings DrTrigon

/home is also mounted with nfs

Although it's strange that reading from big files overloads the servers. stdio or the equivalent functionality in the language they are made should be making it work in blocks.

Looking at willow mounts, /shared and /home are mounted with nfsv3 over udp. But /mnt/user-store and /install don't show it, so they are probably using nfsv4 over tcp. Is that intended?

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCx8+8ACgkQAXWvBxzBrDBSyQCfc7mOdoj45Phyx0p+9Be5sm99 tdcAn0m3hTWswEuvfBGAIBlsmMW9uhNO =+rBS -----END PGP SIGNATURE-----

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Dr. Trigon

26 Nov 26 Nov

11:24 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 25.11.2012 23:03, Marlen Caemmerer wrote:

...

Hello,

nice to hear. I had a look from all sides but it seemed the SGE master thought the queues on the hosts were full. This morning when I looked I saw only willow doing some jobs - ortelius still having this strange state. I waited for Merl to advise me probably reconfiguring but it was too early this morning and I simply deleted some (5 i think) jobs from the queues that were issued on 9th Oct when user-store failed.

I felt the users might probably not wait for this job until now anyway and hoped the queues would regenerate as they were modified. This seems to have solved the problem to my luck ;) as in the logs it seems the jobs were running fine then.

Cool! ...but may be some further advice by Merl still would be a good thing... :)

Greetings DrTrigon

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlCzs8wACgkQAXWvBxzBrDBqbgCfUsA+VV44KfgfCiEOgKMqtrS0 VKYAoNGRUgaqzQa0SxPmtA6lBx3rvWj3 =BtFw -----END PGP SIGNATURE-----

4422

Age (days ago)

4424

Last active (days ago)

toolserver-l@lists.wikimedia.org

9 comments

5 participants

tags (0)

participants (5)

Dr. Trigon
Marlen Caemmerer
Merlissimo
Platonides
Wolfgang Faust