Arthur
On Mon, Jan 4, 2021 at 8:14 PM Arthur Smith <arthurpsmith(a)gmail.com> wrote:
Ok, see T271181 in Toolforge.
Arthur
On Mon, Jan 4, 2021 at 6:59 PM Arthur Smith <arthurpsmith(a)gmail.com>
wrote:
> I've restarted it 3 times already!
>
> On Mon, Jan 4, 2021 at 5:41 PM Brooke Storm <bstorm(a)wikimedia.org> wrote:
>
>> Hello Arthur,
>> I suspect this could be related to a serious problem with LDAP TLS that
>> happened yesterday around the time I’m seeing in the graph. Some
>> information is on this ticket (
https://phabricator.wikimedia.org/T271063).
>> That broke Gerrit authentication and lots of other things that are Cloud
>> Services and Toolforge related until it was resolved. That said, it sounds
>> like there is also something else going on perhaps that we can take a look
>> into. If you haven’t already, restarting the web service might not be a bad
>> idea.
>>
>> If it doesn’t clear up with a restart, please make a Phabricator task to
>> help coordinate.
>>
>> Brooke Storm
>> Staff SRE
>> Wikimedia Cloud Services
>> bstorm(a)wikimedia.org
>>
>>
>>
>> On Jan 4, 2021, at 3:27 PM, Arthur Smith <arthurpsmith(a)gmail.com> wrote:
>>
>> My toolforge service (
https://author-disambiguator.toolforge.org/)
>> keeps becoming unavailable with hangs/502 Bad Gateway or other server
>> errors a few minutes after I restart it, and I can't see what could be
>> causing this. These errors don't show up in the error log, and the 502
>> responses don't show up in the access log (which has had very little
>> traffic anyway - one request per minute at most usually?) I can connect to
>> the kubernetes pod with kubectl and everything looks normal,there's only a
>> few processes listed in /proc, etc. (though it would be nice to have some
>> other monitoring tools like ps and netstat installed by default?) But I
>> can't get a response via the web after the first few minutes.
>>
>> The problem seems to have started mid-day yesterday - see the monitor
>> data here:
>>
>>
>>
https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kube…
>>
>> with the surge in 4xx and 5xx status codes on 1/3 (by the way, I don't
>> see the surge in 4xx status codes in access.log recently either - there are
>> 2 from this morning and none yesterday, nothing like the multiple per
>> second indicated in that grafana chart!)
>>
>> Any ideas what's going on? This looks like some sort of upstream issue
>> with nginx maybe?
>>
>> I am seeing a "You have run out of local ports" error in the error
logs
>> from earlier today (but it hasn't repeated recently) which is maybe a clue?
>> I don't think that could possibly be from anything my service is doing
>> though!
>>
>> Help would be greatly appreciated, thanks!
>>
>> Arthur Smith
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>>
https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>>
https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>