Hello all,
as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.
A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).
If you have questions, please send them to the ML.
Sincerely, DaB.
Just a flash feedback - some ours again I could login again, but qstat gave an error message while crontab was running regularly; now qstat runs again.
Presently is running under Alebot account a IRC script only, that can be considered a test routine; have I to stop it, to make server update easier?
Alex
2013/5/13 DaB. WP@daniel.baur4.info
Hello all,
as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.
A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).
If you have questions, please send them to the ML.
Sincerely, DaB.
-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
On Mon, May 13, 2013, at 05:01 PM, DaB. wrote:
The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).
If you have questions, please send them to the ML.
Is the current outage of replication on sql-s1-user (now approaching 48 hours) related to this ha-node problem? At least some other dbs seem to still have replication working.
Linux is your best bet. Also Errors 404 & 401 are non responsive. I can connect to all servers but on 2 of them msg/nickserver/password is the 401 & 404 error stub. See if this information helps you if not write me back Best Regards [MILASTARX]:[TS] On May 13, 2013 6:02 PM, "DaB." WP@daniel.baur4.info wrote:
Hello all,
as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.
A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).
If you have questions, please send them to the ML.
Sincerely, DaB.
-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
On 05/13/2013 05:01 PM, DaB. wrote:
The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible.
There is a former colleague or mine with whom I've kept contact that is a serious high-grade guru with Solaris. Would you like me to put him in contact with you guys? Maybe he can give a hand or lend expertise?
-- Marc
Sure On May 14, 2013 2:53 PM, "Marc A. Pelletier" marc@uberbox.org wrote:
On 05/13/2013 05:01 PM, DaB. wrote:
The problem is that both ha-nodes run Solaris and all roots are no
Solaris-
experts what makes it hard for us to find errors or in this case
impossible.
There is a former colleague or mine with whom I've kept contact that is a serious high-grade guru with Solaris. Would you like me to put him in contact with you guys? Maybe he can give a hand or lend expertise?
-- Marc
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.
I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for
TOMORROW, between 18:00 and 22:00 UTC.
SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes
Sincerely, DaB.
Hello all, At Sunday 19 May 2013 23:46:02 DaB. wrote:
SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes
the SGE-move working more or less without a problem and everything seems to work AFAIS. It was noticed during the move that the solaris-version of qcronsub was broken and that was fix on the fly too. The LDAP-move is not complete yet and Nosy will continue there tomorrow. So if you notice a LDAP or a (file-)right-problem tomorrow that is nothing to worry about.
Good night, DaB.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Is SVN supposed to be down still?
DrTrigon
On 18.05.2013 22:37, DaB. wrote:
Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.
I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for
TOMORROW, between 18:00 and 22:00 UTC.
SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes
Sincerely, DaB.
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Please let me know if there is any assistance I can provide.
Regards,
AllyUnion
On Mon, May 20, 2013 at 1:37 AM, Dr. Trigon dr.trigon@surfeu.ch wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Is SVN supposed to be down still?
DrTrigon
On 18.05.2013 22:37, DaB. wrote:
Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.
I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for
TOMORROW, between 18:00 and 22:00 UTC.
SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes
Sincerely, DaB.
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlGZ4K0ACgkQAXWvBxzBrDAc9gCgyba+G0ZPh1zJhm2xm08y7Ii7 h0sAn3Tj7pwG2QnSjkpeiPT4a6hbTonF =zOZd -----END PGP SIGNATURE-----
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Please make SVN work... :)
On 20.05.2013 12:39, Jason Y. Lee wrote:
Please let me know if there is any assistance I can provide.
Regards,
AllyUnion
On Mon, May 20, 2013 at 1:37 AM, Dr. Trigon <dr.trigon@surfeu.ch mailto:dr.trigon@surfeu.ch> wrote:
Is SVN supposed to be down still?
DrTrigon
On 18.05.2013 22:37, DaB. wrote:
Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.
I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for
TOMORROW, between 18:00 and 22:00 UTC.
SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes
Sincerely, DaB.
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org
mailto:Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org mailto:Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Hello all, At Tuesday 21 May 2013 00:42:31 DaB. wrote:
Is SVN supposed to be down still?
yes, there is a minor problem with ngnix I haven’t time to fix yet. Also there is a harmless error-message about quota at login. I will try to fix the SVN-problem tomorrow.
Sincerely, DaB.
General chat • Re: Navigating Multiboot GRUB2 menu entries successfully. http://forum.porteus.org/viewtopic.php?t=2195&p=15042#p15042 On May 20, 2013 5:44 PM, "DaB." WP@daniel.baur4.info wrote:
Hello all, At Tuesday 21 May 2013 00:42:31 DaB. wrote:
Is SVN supposed to be down still?
yes, there is a minor problem with ngnix I haven’t time to fix yet. Also there is a harmless error-message about quota at login. I will try to fix the SVN-problem tomorrow.
Sincerely, DaB.
-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
yes, there is a minor problem with ngnix I haven?t time to fix yet. Also there is a harmless error-message about quota at login.
The funny thing with the quota error-message is, it works correct if I do have over-quota when loggin-in. Not so if the quota is not exceeded... ;))
Greetings
(anonymous) wrote:
yes, there is a minor problem with ngnix I haven?t time to fix yet. Also there is a harmless error-message about quota at login.
The funny thing with the quota error-message is, it works correct if I do have over-quota when loggin-in. Not so if the quota is not exceeded... ;))
The more irritating thing is that it works on willow, yet the mounts look identical:
| [tim@passepartout ~]$ for HOST in willow yarrow; do ssh $HOST.toolserver.org mount | grep -w /sge; done | /sge on ha-sge.esi:/global/misc/sge remote/read/write/setuid/devices/rstchown/vers=3/proto=udp/xattr/dev=4b00006 on Sun May 19 20:45:25 2013 | ha-sge.esi:/global/misc/sge on /sge type nfs (rw,proto=udp,vers=3,addr=10.24.1.16) | [tim@passepartout ~]$
Either Solaris's quota is silent about not being able to ac- cess some file systems, or ha-sge.esi seems to be blocked from yarrow, but not from willow ("host ha-nfs.esi" yields the same on both).
Tim
toolserver-l@lists.wikimedia.org