Hello all,
as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.
A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).
If you have questions, please send them to the ML.
Sincerely, DaB.
toolserver-announce@lists.wikimedia.org