[Cloud] Getting a lot of 502, 503 server errors on toolforge ???

4 Jan 2021


      My toolforge service (https://author-disambiguator.toolforge.org/) keeps
becoming unavailable with hangs/502 Bad Gateway or other server errors a
few minutes after I restart it, and I can't see what could be causing this.
These errors don't show up in the error log, and the 502 responses don't
show up in the access log (which has had very little  traffic anyway - one
request per minute at most usually?) I can connect to the kubernetes pod
with kubectl and everything looks normal,there's only a few processes
listed in /proc, etc. (though it would be nice to have some other
monitoring tools like ps and netstat installed by default?) But I can't get
a response via the web after the first few minutes.
The problem seems to have started mid-day yesterday - see the monitor data
here:
https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kuber...
with the surge in 4xx and 5xx status  codes on 1/3 (by the way, I don't see
the surge in 4xx status codes in access.log recently either - there are 2
from this morning and none yesterday, nothing like the multiple per second
indicated in that grafana chart!)
Any ideas what's going on? This looks like some sort of upstream issue with
nginx maybe?
I am seeing a "You have run out of local ports" error in the error logs
from earlier today (but it hasn't repeated recently) which is maybe a clue?
I don't think that could possibly be from anything my service is doing
though!
Help would be greatly appreciated, thanks!
Arthur Smith

2024

2023

2022

2021

2020

2019

2018

2017

[Cloud] Getting a lot of 502, 503 server errors on toolforge ???