[Labs-l] Tool Labs SGE outage

Yuvi Panda yuvipanda at gmail.com
Thu May 28 16:53:02 UTC 2015


Root cause has been found and everything's back to working for the
last few hours. Outage report at
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-GridEngine

Thanks.

On Thu, May 28, 2015 at 1:59 PM, Russell Blau <russblau at imapmail.org> wrote:
> Yuvi Panda <yuvipanda <at> gmail.com> writes:
>
>>
>> It's been back and working mostly well for a while now. According to
>> alerts the partial outage was from 18:33 UTC to 20:17 UTC. More
>> details to follow later, here and at
>> https://phabricator.wikimedia.org/T100554
>
> This seems not to be entirely fixed. All night, I have been getting
> intermittent errors on cron jobs with the following message:
>
> error: commlib error: access denied (server host resolves rdata host
> "tools-submit.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")
>
> Curiously, not all grid jobs fail in this way; some of them have been
> running successfully, but without any apparent pattern.
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l



-- 
Yuvi Panda T
http://yuvi.in/blog



More information about the Labs-l mailing list