Re: [Cloud] Getting a lot of 502, 503 server errors on toolforge ???

5 Jan 2021


      I  mean in Phabricator - https://phabricator.wikimedia.org/T271181
Arthur
On Mon, Jan 4, 2021 at 8:14 PM Arthur Smith arthurpsmith@gmail.com wrote:
...
Ok, see T271181  in  Toolforge.
Arthur
On Mon, Jan 4, 2021 at 6:59 PM Arthur Smith arthurpsmith@gmail.com
wrote:
...
I've restarted it 3 times already!
On Mon, Jan 4, 2021 at 5:41 PM Brooke Storm bstorm@wikimedia.org wrote:
...
Hello Arthur,
I suspect this could be related to a serious problem with LDAP TLS that
happened yesterday around the time I’m seeing in the graph. Some
information is on this ticket (https://phabricator.wikimedia.org/T271063).
That broke Gerrit authentication and lots of other things that are Cloud
Services and Toolforge related until it was resolved. That said, it sounds
like there is also something else going on perhaps that we can take a look
into. If you haven’t already, restarting the web service might not be a bad
idea.
If it doesn’t clear up with a restart, please make a Phabricator task to
help coordinate.
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm@wikimedia.org
On Jan 4, 2021, at 3:27 PM, Arthur Smith arthurpsmith@gmail.com wrote:
My toolforge service (https://author-disambiguator.toolforge.org/)
keeps becoming unavailable with hangs/502 Bad Gateway or other server
errors a few minutes after I restart it, and I can't see what could be
causing this. These errors don't show up in the error log, and the 502
responses don't show up in the access log (which has had very little
traffic anyway - one request per minute at most usually?) I can connect to
the kubernetes pod with kubectl and everything looks normal,there's only a
few processes listed in /proc, etc. (though it would be nice to have some
other monitoring tools like ps and netstat installed by default?) But I
can't get a response via the web after the first few minutes.
The problem seems to have started mid-day yesterday - see the monitor
data here:
https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kuber...
with the surge in 4xx and 5xx status  codes on 1/3 (by the way, I don't
see the surge in 4xx status codes in access.log recently either - there are
2 from this morning and none yesterday, nothing like the multiple per
second indicated in that grafana chart!)
Any ideas what's going on? This looks like some sort of upstream issue
with nginx maybe?
I am seeing a "You have run out of local ports" error in the error logs
from earlier today (but it hasn't repeated recently) which is maybe a clue?
I don't think that could possibly be from anything my service is doing
though!
Help would be greatly appreciated, thanks!
Arthur Smith
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] Getting a lot of 502, 503 server errors on toolforge ???