Just a flash feedback - some ours again I could login again, but qstat gave an error message while crontab was running regularly; now qstat runs again.

Presently is running under Alebot account a IRC script only, that can be considered a test routine; have I to stop it, to make server update easier?


as you have surely noticed the toolserver is even more unstable and unreliable
than normal at the moment. The reason is that our ha-nodes are not longer
working as intended and neither Nosy nor I are able to fix this.

A quick word was ha-nodes are: The "ha" stands for "high available" and we
have 2 servers for that. Some services at the toolserver are so important that
a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons
these services life at the ha-nodes. If one server goes down or crashes then
the other can continue to operate all services with no or little interruption
time and without working by a root. That worked great as long as River was
here and not-so-good in the last months, but now it is totally broken.
The problem is that both ha-nodes run Solaris and all roots are no Solaris-
experts what makes it hard for us to find errors or in this case impossible. We
have setup a very ugly workaround, but it is not stable and so the downtime of
important services cause downtime for the hole toolserver and more work for
the roots.

We can only think of one solution: Replacing the solaris at the ha-nodes with
linux. But this can not start before Friday and it will take some time until
everything is moved over. It will also cause some hours of complete downtime
while /home is copied (we will separately announce this). In best case when
Whitsun is over everything will be working again, in worst case it will need 2
weeks (I will be away between 21 and 26 for the general meeting of WMDE).
The repairing of the ha-nodes has top priority, so everything else will be
delayed (linux-update, database-reimports, account-creation (for VERY
important ones send me a mail), etc.).

