Status of the toolserver

List overview All Threads
Download

newer

older

TS web broken

Encoding issue using SGE

DaB.

13 May 2013 13 May '13

5:01 p.m.

Hello all,

as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.

A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.

We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).

If you have questions, please send them to the ML.

Sincerely, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

Alex Brollo

14 May 14 May

2:04 a.m.

Just a flash feedback - some ours again I could login again, but qstat gave an error message while crontab was running regularly; now qstat runs again.

Presently is running under Alebot account a IRC script only, that can be considered a test routine; have I to stop it, to make server update easier?

Alex

2013/5/13 DaB. WP@daniel.baur4.info

...

Hello all,

as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.

A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.

We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).

If you have questions, please send them to the ML.

Sincerely, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Russell Blau

2:46 p.m.

On Mon, May 13, 2013, at 05:01 PM, DaB. wrote:

...

The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).

If you have questions, please send them to the ML.

Is the current outage of replication on sql-s1-user (now approaching 48 hours) related to this ha-node problem? At least some other dbs seem to still have replication working.

-- Russell Blau russblau@imapmail.org

Patricia Pintilie

3:43 p.m.

Linux is your best bet. Also Errors 404 & 401 are non responsive. I can connect to all servers but on 2 of them msg/nickserver/password is the 401 & 404 error stub. See if this information helps you if not write me back Best Regards [MILASTARX]:[TS] On May 13, 2013 6:02 PM, "DaB." WP@daniel.baur4.info wrote:

...

Hello all,

as you have surely noticed the toolserver is even more unstable and unreliable than normal at the moment. The reason is that our ha-nodes are not longer working as intended and neither Nosy nor I are able to fix this.

A quick word was ha-nodes are: The "ha" stands for "high available" and we have 2 servers for that. Some services at the toolserver are so important that a downtime is unacceptable (like /home, LDAP or the DNS) and for this reasons these services life at the ha-nodes. If one server goes down or crashes then the other can continue to operate all services with no or little interruption time and without working by a root. That worked great as long as River was here and not-so-good in the last months, but now it is totally broken. The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible. We have setup a very ugly workaround, but it is not stable and so the downtime of important services cause downtime for the hole toolserver – and more work for the roots.

We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over. It will also cause some hours of complete downtime while /home is copied (we will separately announce this). In best case when Whitsun is over everything will be working again, in worst case it will need 2 weeks (I will be away between 21 and 26 for the general meeting of WMDE). The repairing of the ha-nodes has top priority, so everything else will be delayed (linux-update, database-reimports, account-creation (for VERY important ones send me a mail), etc.).

If you have questions, please send them to the ML.

Sincerely, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Marc A. Pelletier

3:53 p.m.

On 05/13/2013 05:01 PM, DaB. wrote:

...

The problem is that both ha-nodes run Solaris and all roots are no Solaris- experts what makes it hard for us to find errors or in this case impossible.

There is a former colleague or mine with whom I've kept contact that is a serious high-grade guru with Solaris. Would you like me to put him in contact with you guys? Maybe he can give a hand or lend expertise?

-- Marc

Patricia Pintilie

15 May 15 May

3:21 a.m.

Sure On May 14, 2013 2:53 PM, "Marc A. Pelletier" marc@uberbox.org wrote:

...

On 05/13/2013 05:01 PM, DaB. wrote:

...
The problem is that both ha-nodes run Solaris and all roots are no

Solaris-

...
experts what makes it hard for us to find errors or in this case

impossible.

There is a former colleague or mine with whom I've kept contact that is a serious high-grade guru with Solaris. Would you like me to put him in contact with you guys? Maybe he can give a hand or lend expertise?

-- Marc

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

DaB.

18 May 18 May

4:37 p.m.

Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:

...

We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.

I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for

TOMORROW, between 18:00 and 22:00 UTC.

SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes

Sincerely, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

DaB.

19 May 19 May

5:50 p.m.

Hello all, At Sunday 19 May 2013 23:46:02 DaB. wrote:

...

SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes

the SGE-move working more or less without a problem and everything seems to work AFAIS. It was noticed during the move that the solaris-version of qcronsub was broken and that was fix on the fly too. The LDAP-move is not complete yet and Nosy will continue there tomorrow. So if you notice a LDAP or a (file-)right-problem tomorrow that is nothing to worry about.

Good night, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Dr. Trigon

20 May 20 May

4:37 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Is SVN supposed to be down still?

DrTrigon

On 18.05.2013 22:37, DaB. wrote:

...

Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:

...
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.

I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for

TOMORROW, between 18:00 and 22:00 UTC.

SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes

Sincerely, DaB.

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGZ4K0ACgkQAXWvBxzBrDAc9gCgyba+G0ZPh1zJhm2xm08y7Ii7 h0sAn3Tj7pwG2QnSjkpeiPT4a6hbTonF =zOZd -----END PGP SIGNATURE-----

Jason Y. Lee

6:39 a.m.

Please let me know if there is any assistance I can provide.

Regards,

AllyUnion

On Mon, May 20, 2013 at 1:37 AM, Dr. Trigon dr.trigon@surfeu.ch wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Is SVN supposed to be down still?

DrTrigon

On 18.05.2013 22:37, DaB. wrote:

...
Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:

...
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.

I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for

TOMORROW, between 18:00 and 22:00 UTC.

SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes

Sincerely, DaB.

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlGZ4K0ACgkQAXWvBxzBrDAc9gCgyba+G0ZPh1zJhm2xm08y7Ii7 h0sAn3Tj7pwG2QnSjkpeiPT4a6hbTonF =zOZd -----END PGP SIGNATURE-----

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Dr. Trigon

11:41 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Please make SVN work... :)

On 20.05.2013 12:39, Jason Y. Lee wrote:

...

Please let me know if there is any assistance I can provide.

Regards,

AllyUnion

On Mon, May 20, 2013 at 1:37 AM, Dr. Trigon <dr.trigon@surfeu.ch mailto:dr.trigon@surfeu.ch> wrote:

Is SVN supposed to be down still?

DrTrigon

On 18.05.2013 22:37, DaB. wrote:

...
Hello all, At Saturday 18 May 2013 22:29:07 DaB. wrote:

...
We can only think of one solution: Replacing the solaris at the ha-nodes with linux. But this can not start before Friday and it will take some time until everything is moved over.

...
I started today to move some services to the linux-version of the ha-cluster. Until now nagios (without web-interface), the sql-tunnel to the WMF and the ts-irc-bot moved over. The next big things are SGE and LDAP which will move tomorrow (Sunday). For this I announce a total downtime of SGE for

...
TOMORROW, between 18:00 and 22:00 UTC.

...
SGE will not be down the hole time, but better expect that it can be down anytime during that timeframe. The LDAP-move will happen at a unknown timestamp tomorrow, but the downtime should not be more than a few minutes

...
Sincerely, DaB.

...
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org

mailto:Toolserver-l@lists.wikimedia.org)

...
https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org mailto:Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGaRBIACgkQAXWvBxzBrDDivgCcCYo4eJdVnzYilaRoiT2xx5pF y1sAnj04GD4z4s2A4qJCVzL0gmOeCSjV =+Yr1 -----END PGP SIGNATURE-----

DaB.

6:44 p.m.

Hello all, At Tuesday 21 May 2013 00:42:31 DaB. wrote:

...

Is SVN supposed to be down still?

yes, there is a minor problem with ngnix I haven’t time to fix yet. Also there is a harmless error-message about quota at login. I will try to fix the SVN-problem tomorrow.

Sincerely, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Patricia Pintilie

23 May 23 May

7:31 a.m.

General chat • Re: Navigating Multiboot GRUB2 menu entries successfully. http://forum.porteus.org/viewtopic.php?t=2195&p=15042#p15042 On May 20, 2013 5:44 PM, "DaB." WP@daniel.baur4.info wrote:

...

Hello all, At Tuesday 21 May 2013 00:42:31 DaB. wrote:

...
Is SVN supposed to be down still?

yes, there is a minor problem with ngnix I haven’t time to fix yet. Also there is a harmless error-message about quota at login. I will try to fix the SVN-problem tomorrow.

Sincerely, DaB.

-- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Dr. Trigon

25 May 25 May

5:47 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

...

yes, there is a minor problem with ngnix I haven?t time to fix yet. Also there is a harmless error-message about quota at login.

The funny thing with the quota error-message is, it works correct if I do have over-quota when loggin-in. Not so if the quota is not exceeded... ;))

Greetings

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGgiMQACgkQAXWvBxzBrDDIaQCgr4kDw3LP3BckJmYxMUd4u7YA gqIAoM+U4xSdPdxLQz4TsjCWJgN4X4OH =THtV -----END PGP SIGNATURE-----

Tim Landscheidt

11:55 a.m.

(anonymous) wrote:

...

...
yes, there is a minor problem with ngnix I haven?t time to fix yet. Also there is a harmless error-message about quota at login.

...

The funny thing with the quota error-message is, it works correct if I do have over-quota when loggin-in. Not so if the quota is not exceeded... ;))

The more irritating thing is that it works on willow, yet the mounts look identical:

| [tim@passepartout ~]$ for HOST in willow yarrow; do ssh $HOST.toolserver.org mount | grep -w /sge; done | /sge on ha-sge.esi:/global/misc/sge remote/read/write/setuid/devices/rstchown/vers=3/proto=udp/xattr/dev=4b00006 on Sun May 19 20:45:25 2013 | ha-sge.esi:/global/misc/sge on /sge type nfs (rw,proto=udp,vers=3,addr=10.24.1.16) | [tim@passepartout ~]$

Either Solaris's quota is silent about not being able to ac- cess some file systems, or ha-sge.esi seems to be blocked from yarrow, but not from willow ("host ha-nfs.esi" yields the same on both).

Tim

4200

Age (days ago)

4212

Last active (days ago)

toolserver-l@lists.wikimedia.org

14 comments

8 participants

tags (0)

participants (8)

Alex Brollo
DaB.
Dr. Trigon
Jason Y. Lee
Marc A. Pelletier
Patricia Pintilie
Russell Blau
Tim Landscheidt