On Wed, Nov 20, 2019 at 6:35 PM Roy Smith roy@panix.com wrote:
The last couple of days, I've been having problems with interactive ssh into login.tools.wmflabs.org. Every so often (multiple times an hour, at least), my connection will hang for a few seconds. Sometimes more like 10-15 seconds. I connect from my home MacOS box on broadband using:
Often the cause of this sort of behavior is the allowed bandwidth between the bastion server and the NFS server which provides $HOME directories for both users and tool accounts being saturated by some activity. Doing something trivial seeming like `cd $HOME; ls` calls out the the NFS server for the `ls` data and this can end up queuing for space in the connection to that server.
NFS overload can happen for many reasons, but is more likely when one or more people are running large scp/sftp downloads from the bastion to their local computer or running bots or other programs which generate a lot of disk activity from the bastion directly rather than launching the process on the job grid or kubernetes cluster.
At the moment I am writing this, `pstree -clapu` on login.tools.wmflabs.org shows me:
* tools.lziad running a nodejs process with many active threads * tools.exambot running an irc bot (sopel) * mzmcbride running a script named touch.py * tools.editgroups running a script named lag_watcher.sh * bugreporter running a GNU Screen session with multiple python2 processes open * tools.rebot running a pywikibot script * jarbot-ii running an sftp server * iluvatar running an sftp server * tools.wikiportretdev running an sftp server * tools.largedatasetbot running an sftp server * jjmc89 running an sftp server * magnus running an sftp server * tools.mbrt1 running an unrealircd server (?!)
The sftp servers are expected. We currently do not have any other means for people to upload/download files to and from Toolforge. The other processes all appear at least on the surface to be things that would be better suited to running on either the job grid [0] or the Kubernetes cluster [1].
[0]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid [1]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
Bryan