[Labs-l] SGE issues again

Merlijn van Deen valhallasw at arctus.nl
Tue Jan 12 20:30:58 UTC 2016


As promised, the post-mortem.

tl,dr: the corruption issue we had in december is still there, and bites us
every now and then. We're not entirely sure what is causing the corruption,
but we suspect NFS, and are working to move the database to a local
filesystem.

Long story:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160112-20160111-toollabs-SGE

Again, sorry for the disruptions. Unfortunately we cannot guarantee there
will not be more of these outages in the near future.

Merlijn

On 11 January 2016 at 23:15, Merlijn van Deen <valhallasw at arctus.nl> wrote:

> Somehow sending an e-mail to labs-l seems to resolve issues magically. The
> issue started around 21:00 UTC, and I'll write up a post-mortem tomorrow.
>
> On 11 January 2016 at 23:10, Merlijn van Deen <valhallasw at arctus.nl>
> wrote:
>
>> Jobs are being queued, but are not executing. Every now and then a few
>> jobs /are/ executed, but the backlog is ~20 minutes. We're not quite sure
>> what's happening, unfortunately, but we're working on it.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20160112/1082ad37/attachment.html>


More information about the Labs-l mailing list