The last couple of days, I've been having problems with interactive ssh into login.tools.wmflabs.org http://login.tools.wmflabs.org/. Every so often (multiple times an hour, at least), my connection will hang for a few seconds. Sometimes more like 10-15 seconds. I connect from my home MacOS box on broadband using:
ssh -t -i ~/.ssh/id_rsa_wikimedia roysmith@login.tools.wmflabs.org tmux attach -t work
The load doesn't look unreasonable:
$ uptime 18:26:34 up 35 days, 9:04, 39 users, load average: 0.74, 1.93, 1.80
and ping times look fine:
$ ping -v login.tools.wmflabs.org PING login.tools.wmflabs.org (185.15.56.48): 56 data bytes 64 bytes from 185.15.56.48: icmp_seq=0 ttl=51 time=24.233 ms 64 bytes from 185.15.56.48: icmp_seq=1 ttl=51 time=27.086 ms 64 bytes from 185.15.56.48: icmp_seq=2 ttl=51 time=22.121 ms 64 bytes from 185.15.56.48: icmp_seq=3 ttl=51 time=22.726 ms 64 bytes from 185.15.56.48: icmp_seq=4 ttl=51 time=24.497 ms 64 bytes from 185.15.56.48: icmp_seq=5 ttl=51 time=24.809 ms 64 bytes from 185.15.56.48: icmp_seq=6 ttl=51 time=23.913 ms 64 bytes from 185.15.56.48: icmp_seq=7 ttl=51 time=25.811 ms 64 bytes from 185.15.56.48: icmp_seq=8 ttl=51 time=25.266 ms 64 bytes from 185.15.56.48: icmp_seq=9 ttl=51 time=22.865 ms 64 bytes from 185.15.56.48: icmp_seq=10 ttl=51 time=32.076 ms 64 bytes from 185.15.56.48: icmp_seq=11 ttl=51 time=26.069 ms 64 bytes from 185.15.56.48: icmp_seq=12 ttl=51 time=27.947 ms 64 bytes from 185.15.56.48: icmp_seq=13 ttl=51 time=27.088 ms ^C --- login.tools.wmflabs.org ping statistics --- 14 packets transmitted, 14 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 22.121/25.465/32.076/2.484 ms
I'm in New York City, and login.tools.wmflabs.org http://login.tools.wmflabs.org/ looks like it's in Virginia, so that's pretty close.
This seems to have started in the past few days. Anybody else seeing problems?
On Wed, Nov 20, 2019 at 6:35 PM Roy Smith roy@panix.com wrote:
The last couple of days, I've been having problems with interactive ssh into login.tools.wmflabs.org. Every so often (multiple times an hour, at least), my connection will hang for a few seconds. Sometimes more like 10-15 seconds. I connect from my home MacOS box on broadband using:
Often the cause of this sort of behavior is the allowed bandwidth between the bastion server and the NFS server which provides $HOME directories for both users and tool accounts being saturated by some activity. Doing something trivial seeming like `cd $HOME; ls` calls out the the NFS server for the `ls` data and this can end up queuing for space in the connection to that server.
NFS overload can happen for many reasons, but is more likely when one or more people are running large scp/sftp downloads from the bastion to their local computer or running bots or other programs which generate a lot of disk activity from the bastion directly rather than launching the process on the job grid or kubernetes cluster.
At the moment I am writing this, `pstree -clapu` on login.tools.wmflabs.org shows me:
* tools.lziad running a nodejs process with many active threads * tools.exambot running an irc bot (sopel) * mzmcbride running a script named touch.py * tools.editgroups running a script named lag_watcher.sh * bugreporter running a GNU Screen session with multiple python2 processes open * tools.rebot running a pywikibot script * jarbot-ii running an sftp server * iluvatar running an sftp server * tools.wikiportretdev running an sftp server * tools.largedatasetbot running an sftp server * jjmc89 running an sftp server * magnus running an sftp server * tools.mbrt1 running an unrealircd server (?!)
The sftp servers are expected. We currently do not have any other means for people to upload/download files to and from Toolforge. The other processes all appear at least on the surface to be things that would be better suited to running on either the job grid [0] or the Kubernetes cluster [1].
[0]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid [1]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
Bryan
Since we are talking about this, I've been meaning to ask. I run a build on bastion from time to time when I need to deploy a new version (npm run build). Is there a better way of doing this? Should I try to run the build on the grid?.. It's not a periodical task, I just run it manually maybe once a week and it lasts for 2-3 minutes. Please advise.
Le Loy
On Fri, Nov 22, 2019 at 11:56 PM Ле Лой kf8.wikipedia@gmail.com wrote:
Since we are talking about this, I've been meaning to ask. I run a build on bastion from time to time when I need to deploy a new version (npm run build). Is there a better way of doing this? Should I try to run the build on the grid?.. It's not a periodical task, I just run it manually maybe once a week and it lasts for 2-3 minutes. Please advise.
If your tool is running on Kubernetes normally (`webservice --backend=kubernetes ...` or via a custom deployment), then using `webservice --backend-kubernetes [type] shell` to get an interactive shell that is actually using compute resources from the Kubernetes cluster is better than running on the bastion directly.
If your tool is running on the job grid normally and can run the build as a grid job, please do.
To run an IO intensive process that needs interaction, you can use dev.tools.wmflabs.org [0] as your bastion. This is functionally the same experience as running commands directly on login.tools.wmflabs.org, but this server is intended for use by folks who are doing heavier interactive work and will not interfere with the larger number of folks who are using the "main" bastion.
[0]: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/About_Toolforge#Bastion...
Bryan
If your tool is running on the job grid normally and can run the build as a grid job, please do.
Thank you! I tried running the build on the grid and everything works except for the `webservice` command that seems to be unavailable. How do I restart the web service after my build successfully deployed the latest version?
Le Loy
I been experiencing a lot of problems in that bastion too. Apart from intermittent hangs and ssh timeouts some files are getting randomly corrupted while transferring them over SFTP, every time I copy them they have a different checksum.
It was so problematic that I had to pack them all in a .zip file on my computer and try copying it several times until the checksum matches. That never has happened before.
On Nov 28, 2019 21:26, Ле Лой kf8.wikipedia@gmail.com wrote:
If your tool is running on the job grid normally and can run the build as a grid job, please do.
Thank you! I tried running the build on the grid and everything works except for the `webservice` command that seems to be unavailable. How do I restart the web service after my build successfully deployed the latest version?
Le Loy
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud