We lost a KVM host at around 7:20 UTC. Because we use local storage for instances there are a number of them that are down. Toolforge suffered a few losses but it seems to have been few enough that GridEngine and Kubernetes users are unaffected at this time . The initial task is T187292 (with a list of instances), and an incident report will follow. We hope to recover all of the instances that are down but it will take time to sort through.
On 2/14/18 6:58 AM, Chase Pettet wrote:
We lost a KVM host at around 7:20 UTC. Because we use local storage for instances there are a number of them that are down. Toolforge suffered a few losses but it seems to have been few enough that GridEngine and Kubernetes users are unaffected at this time . The initial task is T187292 (with a list of instances), and an incident report will follow. We hope to recover all of the instances that are down but it will take time to sort through.
This outage is still ongoing.
We're currently waiting on some on-site data center work (re-applying thermal paste to the hosts' CPUs) before determining exactly how to respond. It still appears that no actual data has been lost but the affected VMs will remain turned off for several more hours.
Here is a complete list of the VMs that are affected by this:
accounts-appserver4.account-creation-assistance accounts-mwoauth.account-creation-assistance bastion-02.bastion bastion-restricted-02.bastion bf-wmpageview.butterfly chat-bots.mobile ci-jessie-wikimedia-965167.contintcloud ci-jessie-wikimedia-965171.contintcloud ci-jessie-wikimedia-965176.contintcloud ci-jessie-wikimedia-965182.contintcloud ci-jessie-wikimedia-965183.contintcloud ci-jessie-wikimedia-965184.contintcloud ci-jessie-wikimedia-965185.contintcloud client.nonfreewiki commonsarchive-production.commonsarchive cxserver2.language dashboardchat.globaleducation deployment-changeprop.deployment-prep deployment-elastic05.deployment-prep deployment-ircd.deployment-prep deployment-mathoid.deployment-prep deployment-sca02.deployment-prep drmf2016.math huggle-pg.huggle incubator-web.incubator integration-slave-jessie-1001.integration integration-slave-jessie-1002.integration k8s-bastion.chasetestproject language-mleb-master.language ldfclient.wikidata-query math-ru.math mwaas-k8-node-02.scrumbugz mwoffliner1.mwoffliner mwv-apt-01.mwv-apt newsletter-test.newsletter ores-lb-02.ores ores-worker-04.ores overpass-wiki.maps puppetmaster-keith.puppet reflex2.design rel.search stack.reading-web-staging tools-docker-builder-05.tools tools-exec-1413.tools tools-exec-1442.tools tools-webgrid-lighttpd-1427.tools tools-webgrid-lighttpd-1428.tools torproxy.security-tools udpmx-01.ircd video-redis.video wikidataconcepts.wikidataconcepts wikiedu-dashboard-staging.globaleducation wikilabels-experiment.wikilabels wikilabels-staging-01.wikilabels wikimetrics-staging.wikimetrics wikimetrics-test.wikimetrics wmde-wikidiff2-patched.wikidiff2-wmde-dev zk1-1.analytics
-- Chase Pettet chasemp on phabricator https://phabricator.wikimedia.org/p/chasemp/ and IRC
Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
The host in question has been repaired and restarted; all hosted VMs should now be up and running.
We're not 100% certain that we've addressed the root cause of the problem, so we will see if it dies again. In the meantime, though, everything should be back to normal.
Sorry for the downtime!
-Andrew + the WMCS team
On 2/14/18 8:29 AM, Andrew Bogott wrote:
On 2/14/18 6:58 AM, Chase Pettet wrote:
We lost a KVM host at around 7:20 UTC. Because we use local storage for instances there are a number of them that are down. Toolforge suffered a few losses but it seems to have been few enough that GridEngine and Kubernetes users are unaffected at this time . The initial task is T187292 (with a list of instances), and an incident report will follow. We hope to recover all of the instances that are down but it will take time to sort through.
This outage is still ongoing.
We're currently waiting on some on-site data center work (re-applying thermal paste to the hosts' CPUs) before determining exactly how to respond. It still appears that no actual data has been lost but the affected VMs will remain turned off for several more hours.
Here is a complete list of the VMs that are affected by this:
accounts-appserver4.account-creation-assistance accounts-mwoauth.account-creation-assistance bastion-02.bastion bastion-restricted-02.bastion bf-wmpageview.butterfly chat-bots.mobile ci-jessie-wikimedia-965167.contintcloud ci-jessie-wikimedia-965171.contintcloud ci-jessie-wikimedia-965176.contintcloud ci-jessie-wikimedia-965182.contintcloud ci-jessie-wikimedia-965183.contintcloud ci-jessie-wikimedia-965184.contintcloud ci-jessie-wikimedia-965185.contintcloud client.nonfreewiki commonsarchive-production.commonsarchive cxserver2.language dashboardchat.globaleducation deployment-changeprop.deployment-prep deployment-elastic05.deployment-prep deployment-ircd.deployment-prep deployment-mathoid.deployment-prep deployment-sca02.deployment-prep drmf2016.math huggle-pg.huggle incubator-web.incubator integration-slave-jessie-1001.integration integration-slave-jessie-1002.integration k8s-bastion.chasetestproject language-mleb-master.language ldfclient.wikidata-query math-ru.math mwaas-k8-node-02.scrumbugz mwoffliner1.mwoffliner mwv-apt-01.mwv-apt newsletter-test.newsletter ores-lb-02.ores ores-worker-04.ores overpass-wiki.maps puppetmaster-keith.puppet reflex2.design rel.search stack.reading-web-staging tools-docker-builder-05.tools tools-exec-1413.tools tools-exec-1442.tools tools-webgrid-lighttpd-1427.tools tools-webgrid-lighttpd-1428.tools torproxy.security-tools udpmx-01.ircd video-redis.video wikidataconcepts.wikidataconcepts wikiedu-dashboard-staging.globaleducation wikilabels-experiment.wikilabels wikilabels-staging-01.wikilabels wikimetrics-staging.wikimetrics wikimetrics-test.wikimetrics wmde-wikidiff2-patched.wikidiff2-wmde-dev zk1-1.analytics
-- Chase Pettet chasemp on phabricator https://phabricator.wikimedia.org/p/chasemp/ and IRC
Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerlylabs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce