[Labs-admin] tools: main page (and possibly other tools) just died, then (after about half an hour) restarted itself (?)

Alex Monk amonk at wikimedia.org
Sun Nov 20 07:32:02 UTC 2016


[06:33:00] <icinga-wm> PROBLEM - tools homepage -admin tool- on
tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not
Available - 531 bytes in 0.021 second response time
[06:34:03] <shinken-wm> PROBLEM - ToolLabs Home Page on toollabs is
CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - string
'Magnus' not found on 'http://tools.wmflabs.org:80/' - 531 bytes in 0.031
second response time

I started looking into this
* Checked a couple of tools, other things e.g. GUC appear up (so didn't SMS
any ops as I'm not sure the main page is that important)
* Found it runs on the grid and tried `qmod -rj lighttpd-admin`
* It appears up after this, but only briefly, then it's gone again
* I try to figure out how to start it
* Attempted 'webservice start', which looked OK, but 'webservice status'
would always say 'Your webservice is not running'
* ~07:13:24ish - it mysteriously appears online again
* 07:16:52 - Matthew Bowker informs me that xTools was down too (no
monitoring from shinken or icinga alerted IRC of this, but possibly
connected) - he says the error from 'webservice restart' was
https://www.irccloud.com/pastebin/w6AfLja7/

I was looking at /data/project/.system/gridengine/spool/qmaster/messages
while this was happening, I see quite a few 'host
"tools-cron-01.tools.eqiad.wmflabs" is no admin host' errors in there
though I have no reason to believe that's connected.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-admin/attachments/20161120/656191fc/attachment.html>


More information about the Labs-admin mailing list