Would you mind giving me a ping when it fails? I will see if I can find
anything with ptrace (strace / gdb). It might be slightly easier to debug
if it's running on grid so I don't have to mess with Linux namespaces.
YiFei Zhu
On Sun, Jan 26, 2020 at 9:57 AM Russell Blau <russblau(a)imapmail.org> wrote:
TL;DR: The lighttpd webservice for
https://tools.wmflabs.org/dplbot/
fails repeatedly, frequently, and unpredictably, and I have been unable to
diagnose any cause.
Currently, tools.dplbot is running a php7.2 webservice on the kubernetes
backend; however, the failures started occurring when it was running
lighttpd on the job grid, and the move to kubernetes does not seem to have
changed anything in this respect. The tool serves a variety of PHP-based
pages which generate reports from the Toolforge database replicas.
The symptom of failure is that all requests get rejected with 503 service
unavailable. The lighttpd process continues to run (which is why I am
calling this a "failure" rather than a "crash"), which means
kubernetes
doesn't detect any problem and doesn't restart the server, but the server
does not respond to any requests. The "webservice status" command claims
that the webservice is still running. Every time this happens, I have to
restart the webservice. The webservice appears to fail immediately after
some restarts, while in other cases it runs normally for a period of time,
which is highly variable (minutes to hours) and then fails again.
Even more frustrating than the constant failures is the lack of any
information to allow diagnosing the cause of this. The error.log file
(/data/project/dplbot/error.log) does not show any error messages
corresponding to the times of failures. I tried various lighttpd debugging
options, and none of these gave me anything useful. They appear to show all
requests being handled normally, and no debug information at all at or or
after the point of failure. I also reactivated access logging
(/data/project/dplbot/access.log), and this only shows requests that were
handled correctly. In other words, there is no log indicating a request
that came in at/just before a failure without a corresponding response
going out.
If these failures were being caused spontaneously by some problem in
lighttpd or in the Toolforge infrastructure, I would expect other users to
be affected by them, but that doesn't seem to be the case.
This has previously been reported at
https://phabricator.wikimedia.org/T115231 (including more detail on the
debug options I tried), where frankly I have received absolutely no
assistance. I did receive one mildly helpful comment from bd808 on a
related issue (
https://phabricator.wikimedia.org/T218915), as follows:
... [It is] possible to have a Kubernetes powered webservice become
unresponsive to client requests due to an internal deadlock or resource
exhaustion issue in the application which does not also lead to a crash of
the lighttpd process itself.
However, if there is an internal deadlock or resource exhaustion issue in
the underlying PHP scripts, I would expect some error message in the logs,
which isn't there. Also, during a recent interval when the server was up
for a while, I took the time to click every single link on
https://tools.wmflabs.org/dplbot/, and the server responded to every one
of them, so there does not seem to be a fatal bug in any of the scripts
(although this exercise revealed a few minor issues).
I'm not necessarily looking for someone to solve this problem for me
(although that would be nice :-) ), but just some ideas about how to
identify potential causes. Right now it is basically a black hole; no
information whatsoever is coming out of the webserver at the point of
failure, so I can make no progress.
--
Russell Blau
russblau(a)imapmail.org
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud