Webservice failures (Toolforge) - Cloud

26 Jan 2020

TL;DR: The lighttpd webservice for https://tools.wmflabs.org/dplbot/ fails repeatedly,
frequently, and unpredictably, and I have been unable to diagnose any cause.

Currently, tools.dplbot is running a php7.2 webservice on the kubernetes backend; however,
the failures started occurring when it was running lighttpd on the job grid, and the move
to kubernetes does not seem to have changed anything in this respect. The tool serves a
variety of PHP-based pages which generate reports from the Toolforge database replicas.

The symptom of failure is that all requests get rejected with 503 service unavailable. The
lighttpd process continues to run (which is why I am calling this a "failure"
rather than a "crash"), which means kubernetes doesn't detect any problem
and doesn't restart the server, but the server does not respond to any requests. The
"webservice status" command claims that the webservice is still running. Every
time this happens, I have to restart the webservice. The webservice appears to fail
immediately after some restarts, while in other cases it runs normally for a period of
time, which is highly variable (minutes to hours) and then fails again.

Even more frustrating than the constant failures is the lack of any information to allow
diagnosing the cause of this. The error.log file (/data/project/dplbot/error.log) does not
show any error messages corresponding to the times of failures. I tried various lighttpd
debugging options, and none of these gave me anything useful. They appear to show all
requests being handled normally, and no debug information at all at or or after the point
of failure. I also reactivated access logging (/data/project/dplbot/access.log), and this
only shows requests that were handled correctly. In other words, there is no log
indicating a request that came in at/just before a failure without a corresponding
response going out.

If these failures were being caused spontaneously by some problem in lighttpd or in the
Toolforge infrastructure, I would expect other users to be affected by them, but that
doesn't seem to be the case. 

This has previously been reported at https://phabricator.wikimedia.org/T115231 (including
more detail on the debug options I tried), where frankly I have received absolutely no
assistance. I did receive one mildly helpful comment from bd808 on a related issue
(https://phabricator.wikimedia.org/T218915), as follows:

...
  ... [It is] possible to have a Kubernetes powered
webservice become unresponsive to client requests due to an internal deadlock or resource
exhaustion issue in the application which does not also lead to a crash of the lighttpd
process itself. 
However, if there is an internal deadlock or resource exhaustion issue in the underlying
PHP scripts, I would expect some error message in the logs, which isn't there. Also,
during a recent interval when the server was up for a while, I took the time to click
every single link on https://tools.wmflabs.org/dplbot/, and the server responded to every
one of them, so there does not seem to be a fatal bug in any of the scripts (although this
exercise revealed a few minor issues).

I'm not necessarily looking for someone to solve this problem for me (although that
would be nice :-) ), but just some ideas about how to identify potential causes. Right now
it is basically a black hole; no information whatsoever is coming out of the webserver at
the point of failure, so I can make no progress.

-- 
 Russell Blau
 russblau(a)imapmail.org