Am 24.11.2012 20:43, schrieb Marlen Caemmerer:
Hello,
a broken nfs mount was the source of the slow login. Dont know if it affected SGE as well but I tried to mount the user-store and I got the error "Out of stream resources". There might be something fishy with the local disks too since cat /etc/vfstab took ages 2 times and ls resulted in "no such file or directory" twice too. But ipmi logs and the raid utility from solaris showed no faults. I rebooted and the system now seems to be running ok. Do you still see any issue?
Cheers nosy
At 20:32 on Nov 23th sge on turnera stopped and was started at damiana. The qmaster thread started successfully because it responses pings and so on. But the scheduler thread seems not to work. qconf -tsm does not show any status information (which whould be written to logs when is send this command). That's why no new jobs are send to execution clients.
So the switch over on the ha-cluster failed.
Merlissimo
@All: If you are working on big files please copy them to local temp first (on sge $TMP contains an individual temp dir for the job). E.g. piping big files to other slow programs causes much nfs load because data must be read in small packages which cause high load on servers. That's why sge cannot schedule new jobs on nightshade since days.