[Labs-l] Debugging lighttpd OOM's

Maximilian Doerr maximilian.doerr at gmail.com
Tue Jun 10 11:58:49 UTC 2014


Go with hedonil's scripts.  They're very good.

Gesendet von Maximilian's iPhone. 
(Sent from Maximilian's iPhone.)

> On Jun 10, 2014, at 06:36, Magnus Manske <magnusmanske at googlemail.com> wrote:
> 
> As the maintainer of several dozen tools, this happens on a regular basis. No automatic notification, nor automatic restart. Pitiful, really.
> 
> Hedonil has written a set of scripts to run the webservice in a more reliable manner, and even has an "auto-restarter", which I use for some of the tools where the standard webservice used to die on an almost daily basis.
> 
> Tools Labs should really improve this.
> 
> 
>> On Tue, Jun 10, 2014 at 10:28 AM, Merlijn van Deen <valhallasw at arctus.nl> wrote:
>> Hello all,
>> 
>> My 'tsreports' webservice randomly dies every now and then. qacct suggests this is due to OOM:
>> 
>> tools.tsreports at tools-login:~$ qacct -j 487745
>> qname        webgrid-lighttpd
>> (...)
>> jobname      lighttpd-tsreports
>> jobnumber    487745
>> (...)
>> qsub_time    Wed Apr 23 08:18:12 2014
>> start_time   Fri May 23 14:30:17 2014
>> end_time     Fri Jun  6 10:51:21 2014
>> (...)
>> failed       0
>> exit_status  0
>> (...)
>> maxvmem      3.973G
>> 
>> 
>> I have no clue how to debug this, though; the lighttpd error log just shows
>> 
>> 2014-06-06 10:51:20: (mod_fastcgi.c.3061) got proc: pid: 12119 socket: unix:/tmp/tsreports-index.fcgi.sock-0 load: 1
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:21: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:21: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 
>> which is not very informative, to say the least.
>> 
>> So: how can one debug these issues?
>> 
>> To add insult to the injury, SGE doesn't even send an e-mail to tell me it killed the webserver, nor does it re-start the webserver. Either of those would be reasonable (especially the option 'restart the webserver'). Now I had to be notified by someone on my talk page...
>> 
>> Merlijn
>> 
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>> 
> 
> 
> 
> -- 
> undefined
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20140610/43ee6f9d/attachment.html>


More information about the Labs-l mailing list