[Labs-l] Debugging lighttpd OOM's
Maximilian Doerr
maximilian.doerr at gmail.com
Tue Jun 10 11:58:49 UTC 2014
Go with hedonil's scripts. They're very good.
Gesendet von Maximilian's iPhone.
(Sent from Maximilian's iPhone.)
> On Jun 10, 2014, at 06:36, Magnus Manske <magnusmanske at googlemail.com> wrote:
>
> As the maintainer of several dozen tools, this happens on a regular basis. No automatic notification, nor automatic restart. Pitiful, really.
>
> Hedonil has written a set of scripts to run the webservice in a more reliable manner, and even has an "auto-restarter", which I use for some of the tools where the standard webservice used to die on an almost daily basis.
>
> Tools Labs should really improve this.
>
>
>> On Tue, Jun 10, 2014 at 10:28 AM, Merlijn van Deen <valhallasw at arctus.nl> wrote:
>> Hello all,
>>
>> My 'tsreports' webservice randomly dies every now and then. qacct suggests this is due to OOM:
>>
>> tools.tsreports at tools-login:~$ qacct -j 487745
>> qname webgrid-lighttpd
>> (...)
>> jobname lighttpd-tsreports
>> jobnumber 487745
>> (...)
>> qsub_time Wed Apr 23 08:18:12 2014
>> start_time Fri May 23 14:30:17 2014
>> end_time Fri Jun 6 10:51:21 2014
>> (...)
>> failed 0
>> exit_status 0
>> (...)
>> maxvmem 3.973G
>>
>>
>> I have no clue how to debug this, though; the lighttpd error log just shows
>>
>> 2014-06-06 10:51:20: (mod_fastcgi.c.3061) got proc: pid: 12119 socket: unix:/tmp/tsreports-index.fcgi.sock-0 load: 1
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:20: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:21: (server.c.1502) unlink failed for: /var/run/lighttpd/tsreports.pid 2 No such file or directory
>> 2014-06-06 10:51:21: (server.c.1512) server stopped by UID = 0 PID = 12087
>> 2014-06-06 10:51:20: (server.c.1512) server stopped by UID = 0 PID = 12087
>>
>> which is not very informative, to say the least.
>>
>> So: how can one debug these issues?
>>
>> To add insult to the injury, SGE doesn't even send an e-mail to tell me it killed the webserver, nor does it re-start the webserver. Either of those would be reasonable (especially the option 'restart the webserver'). Now I had to be notified by someone on my talk page...
>>
>> Merlijn
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>
>
>
>
> --
> undefined
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20140610/43ee6f9d/attachment.html>
More information about the Labs-l
mailing list