[Labs-l] Debugging lighttpd OOM's

Magnus Manske magnusmanske at googlemail.com
Tue Jun 10 10:36:19 UTC 2014


As the maintainer of several dozen tools, this happens on a regular basis.
No automatic notification, nor automatic restart. Pitiful, really.

Hedonil has written a set of scripts to run the webservice in a more
reliable manner, and even has an "auto-restarter", which I use for some of
the tools where the standard webservice used to die on an almost daily
basis.

Tools Labs should really improve this.


On Tue, Jun 10, 2014 at 10:28 AM, Merlijn van Deen <valhallasw at arctus.nl>
wrote:

> Hello all,
>
> My 'tsreports' webservice randomly dies every now and then. qacct suggests
> this is due to OOM:
>
> tools.tsreports at tools-login:~$ qacct -j 487745
> qname        webgrid-lighttpd
> (...)
> jobname      lighttpd-tsreports
> jobnumber    487745
> (...)
> qsub_time    Wed Apr 23 08:18:12 2014
> start_time   Fri May 23 14:30:17 2014
> end_time     Fri Jun  6 10:51:21 2014
> (...)
> failed       0
> exit_status  0
> (...)
> maxvmem      3.973G
>
>
> I have no clue how to debug this, though; the lighttpd error log just shows
>
> 2014-06-06 10:51:20: (mod_fastcgi.c.3061) got proc: pid: 12119 socket:
> unix:/tmp/tsreports-index.fcgi.sock-0 load: 1
> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
> 2014-06-06 10:51:20: (server.c.1502) unlink failed for:
> /var/run/lighttpd/tsreports.pid 2 No such file or directory
> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
> 2014-06-06 10:51:20: (server.c.1502) unlink failed for:
> /var/run/lighttpd/tsreports.pid 2 No such file or directory
> 2014-06-06 10:51:20: (server.c.1502) unlink failed for:
> /var/run/lighttpd/tsreports.pid 2 No such file or directory
> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
> 2014-06-06 10:51:21: (server.c.1502) unlink failed for:
> /var/run/lighttpd/tsreports.pid 2 No such file or directory
> 2014-06-06 10:51:21: (server.c.1512) server stopped by UID = 0 PID = 12087
> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>
> which is not very informative, to say the least.
>
> So: how can one debug these issues?
>
> To add insult to the injury, SGE doesn't even send an e-mail to tell me it
> killed the webserver, nor does it re-start the webserver. Either of those
> would be reasonable (especially the option 'restart the webserver'). Now I
> had to be notified by someone on my talk page...
>
> Merlijn
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>


-- 
undefined
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20140610/59b1309e/attachment.html>


More information about the Labs-l mailing list