We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again" things are now running without warnings. The VMs listed below are now coming back online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
I spoke too soon -- we're still working on this. Most of these VMs will remain down in the meantime.
Sorry for the outage!
On 2/13/19 8:21 AM, Andrew Bogott wrote:
We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again" things are now running without warnings. The VMs listed below are now coming back online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
Here's the latest:
cloudvirt1018 is up and running, and many of its VMs are fine. Many other VMs are corrupted and won't start up. Some of those VMs will probably be lost for good, but we're still investigating rescue options.
In the meantime, if your VM is up and you can access it then you're in luck! If not, stay tuned.
-Andrew
On 2/13/19 9:15 AM, Andrew Bogott wrote:
I spoke too soon -- we're still working on this. Most of these VMs will remain down in the meantime.
Sorry for the outage!
On 2/13/19 8:21 AM, Andrew Bogott wrote:
We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again" things are now running without warnings. The VMs listed below are now coming back online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
Now cloudvirt1024 is dying in earnest, so VMs hosted there will be down for a while as well. This is, as far as anyone can tell, just a stupid coincidence.
So far it appears that we are going to be able to rescue /most/ things without significant data loss. For now, though, there's going to be plenty more downtime.
VMs on cloudvirt1024 are:
| 8113d2c5-6788-43f6-beeb-123b0b717af3 | drmf-beta | math | 169b3260-4f7e-43dc-94c2-e699308a3426 | ecmabot | webperf | 29e875e3-15d5-4f74-9716-c0025c2ea098 | encoding02 | video | 1b2b8b50-d463-4b7f-a3a9-6363eeb3ca8b | encoding03 | video | 5421f938-7a11-499c-bc6a-534da1f4e27d | hafnium | rcm | 041d42b9-df36-4176-9f5d-a508989bbebc | hound-app-01 | hound | 6149375b-8a08-4f03-882a-6fc0f5f77499 | integration-slave-docker-1044 | integration | 4d64b032-d93a-4a8c-a7e5-569c17e5063f | integration-slave-docker-1046 | integration | ad48959a-9eb9-46a9-bec4-a2bf23cdf655 | integration-slave-docker-1047 | integration | 21644632-0972-448f-83d0-b76f9d1d28e0 | ldfclient-new | wikidata-query | c2a30fe0-2c87-4b01-be53-8e2a3d0f40a7 | math-docker | math | df8f17fb-03fe-4725-b9cf-3d9fe76f4654 | mediawiki2latex | collection-alt-renderer | d73f36e6-7534-4910-9a6e-64a6b9088d1e | neon | rcm | 2d035965-ba53-41b3-b6ef-d2ebbe50656a | novaadminmadethis | quotatest | c84f61c0-4fd2-47a5-b6ab-dd6b5ea98d41 | ores-puppetmaster-01 | ores | 585bb328-8078-4437-b076-9e555683e27d | ores-sentinel-01 | ores | 0538bfed-d7b5-4751-9431-8feecbaf78c0 | oxygen | rcm | e8090d9e-7529-46a9-b1e1-c4ba523a2898 | packaging | thumbor | c7fe4663-7f2b-4d23-a79b-1a2e01c80d93 | twlight-prod | twl | 2370b38f-7a65-4ccf-a635-7a2fa5e12b3e | twlight-staging | twl | 464577c6-86f0-42f9-9c49-86f9ec9a0210 | twlight-tracker | twl | 5325322d-a57e-4a9b-85b7-37643f03bfea | wikidata-misc | wikidata-dev
On 2/13/19 11:23 AM, Andrew Bogott wrote:
Here's the latest:
cloudvirt1018 is up and running, and many of its VMs are fine. Many other VMs are corrupted and won't start up. Some of those VMs will probably be lost for good, but we're still investigating rescue options.
In the meantime, if your VM is up and you can access it then you're in luck! If not, stay tuned.
-Andrew
On 2/13/19 9:15 AM, Andrew Bogott wrote:
I spoke too soon -- we're still working on this. Most of these VMs will remain down in the meantime.
Sorry for the outage!
On 2/13/19 8:21 AM, Andrew Bogott wrote:
We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again" things are now running without warnings. The VMs listed below are now coming back online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce
I ask this because of these failures. Where does cyberbot-db-01 live? The data on there is critical.
Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer
-----Original Message----- From: Cloud cloud-bounces@lists.wikimedia.org On Behalf Of Andrew Bogott Sent: Wednesday, February 13, 2019 14:51 To: Cloud-announce@lists.wikimedia.org Subject: Re: [Cloud] [Cloud-announce] VPS hardware failure -- things are even worse!
Now cloudvirt1024 is dying in earnest, so VMs hosted there will be down for a while as well. This is, as far as anyone can tell, just a stupid coincidence.
So far it appears that we are going to be able to rescue /most/ things without significant data loss. For now, though, there's going to be plenty more downtime.
VMs on cloudvirt1024 are:
| 8113d2c5-6788-43f6-beeb-123b0b717af3 | drmf-beta | math | 169b3260-4f7e-43dc-94c2-e699308a3426 | ecmabot | webperf | 29e875e3-15d5-4f74-9716-c0025c2ea098 | encoding02 | video | 1b2b8b50-d463-4b7f-a3a9-6363eeb3ca8b | encoding03 | video | 5421f938-7a11-499c-bc6a-534da1f4e27d | hafnium | rcm | 041d42b9-df36-4176-9f5d-a508989bbebc | hound-app-01 | hound | 6149375b-8a08-4f03-882a-6fc0f5f77499 | integration-slave-docker-1044 | integration | 4d64b032-d93a-4a8c-a7e5-569c17e5063f | integration-slave-docker-1046 | integration | ad48959a-9eb9-46a9-bec4-a2bf23cdf655 | integration-slave-docker-1047 | integration | 21644632-0972-448f-83d0-b76f9d1d28e0 | ldfclient-new | wikidata-query | c2a30fe0-2c87-4b01-be53-8e2a3d0f40a7 | math-docker | math | df8f17fb-03fe-4725-b9cf-3d9fe76f4654 | mediawiki2latex | collection-alt-renderer | d73f36e6-7534-4910-9a6e-64a6b9088d1e | neon | rcm | 2d035965-ba53-41b3-b6ef-d2ebbe50656a | novaadminmadethis | quotatest | c84f61c0-4fd2-47a5-b6ab-dd6b5ea98d41 | ores-puppetmaster-01 | ores | 585bb328-8078-4437-b076-9e555683e27d | ores-sentinel-01 | ores | 0538bfed-d7b5-4751-9431-8feecbaf78c0 | oxygen | rcm | e8090d9e-7529-46a9-b1e1-c4ba523a2898 | packaging | thumbor | c7fe4663-7f2b-4d23-a79b-1a2e01c80d93 | twlight-prod | twl | 2370b38f-7a65-4ccf-a635-7a2fa5e12b3e | twlight-staging | twl | 464577c6-86f0-42f9-9c49-86f9ec9a0210 | twlight-tracker | twl | 5325322d-a57e-4a9b-85b7-37643f03bfea | wikidata-misc | wikidata-dev
On 2/13/19 11:23 AM, Andrew Bogott wrote:
Here's the latest:
cloudvirt1018 is up and running, and many of its VMs are fine. Many other VMs are corrupted and won't start up. Some of those VMs will probably be lost for good, but we're still investigating rescue options.
In the meantime, if your VM is up and you can access it then you're in luck! If not, stay tuned.
-Andrew
On 2/13/19 9:15 AM, Andrew Bogott wrote:
I spoke too soon -- we're still working on this. Most of these VMs will remain down in the meantime.
Sorry for the outage!
On 2/13/19 8:21 AM, Andrew Bogott wrote:
We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again" things are now running without warnings. The VMs listed below are now coming back online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Wed, Feb 13, 2019 at 2:45 PM Maximilian Doerr maximilian.doerr@gmail.com wrote:
I ask this because of these failures. Where does cyberbot-db-01 live?
Per https://tools.wmflabs.org/openstack-browser/project/cyberbot it is on cloudvirt1023.eqiad.wmnet
The data on there is critical.
As you probably know, we do not currently have a trusted back up solution for Cloud VPS projects. Our best recommendation for 'critical' data is for you to setup some manual or automated backup to an offsite location (your laptop, a VPS hosted outside Cloud VPS, etc). Hopefully we will have some news on an actual reliable backup service in the coming months. We have some hardware to build an initial system for this, but have not yet had time to design and implement the backup service itself.
Bryan
One consequence of this outage is that the server behind the Toolforge Stretch bastion (login-stretch.tools.wmflabs.org) has changed. If you are seeing a scary warning like this:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:8fLy4F9XDYdR/uHihWoPihKDhPaxCh0au/paSdGB7K8. Please contact your system administrator. Add correct host key in *HOME*/.ssh/known_hosts to get rid of this message. Offending ECDSA key in *HOME*/.ssh/known_hosts:*LINE* ECDSA host key for login-stretch.tools.wmflabs.org has changed and you have requested strict checking.
Host key verification failed.
then you will need to update your known_hosts file. It probably contains a line like this:
login-stretch.tools.wmflabs.org,185.15.56.48 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEEMihgdO9CXKJvpoO4LMOt1cU43zIQJiXOm1doVMh0z+uXntQkNDyF\ eHJ9//T983eL8efbCBEgnB9POGfYfoas=
You can either change this to
login-stretch.tools.wmflabs.org,185.15.56.48 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFnJSjCGW7kli+cdgtmndPAl4xLZNc9uqP9KWlsnVDqr8yQ2RkR5ACb\ Xe6XZ+dS09Wc9ulOmGTOwCImMi9Fho78=
or remove the line and then look for the following output the next time you SSH into the bastion:
The authenticity of host 'login-stretch.tools.wmflabs.org (185.15.56.48)' can't be established. ECDSA key fingerprint is SHA256:8fLy4F9XDYdR/uHihWoPihKDhPaxCh0au/paSdGB7K8.
Good luck! Cheers, Lucas
Am Mi., 13. Feb. 2019 um 22:53 Uhr schrieb Bryan Davis <bd808@wikimedia.org
:
On Wed, Feb 13, 2019 at 2:45 PM Maximilian Doerr maximilian.doerr@gmail.com wrote:
I ask this because of these failures. Where does cyberbot-db-01 live?
Per https://tools.wmflabs.org/openstack-browser/project/cyberbot it is on cloudvirt1023.eqiad.wmnet
The data on there is critical.
As you probably know, we do not currently have a trusted back up solution for Cloud VPS projects. Our best recommendation for 'critical' data is for you to setup some manual or automated backup to an offsite location (your laptop, a VPS hosted outside Cloud VPS, etc). Hopefully we will have some news on an actual reliable backup service in the coming months. We have some hardware to build an initial system for this, but have not yet had time to design and implement the backup service itself.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
(Or you can check the fingerprints page on Wikitech https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login-stretch.tools.wmflabs.org, which has now been updated, instead of trusting me. It also has the fingerprints in additional formats.)
Am Do., 14. Feb. 2019 um 12:23 Uhr schrieb Lucas Werkmeister < lucas.werkmeister@wikimedia.de>:
One consequence of this outage is that the server behind the Toolforge Stretch bastion (login-stretch.tools.wmflabs.org) has changed. If you are seeing a scary warning like this:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ECDSA key sent by the remote host is SHA256:8fLy4F9XDYdR/uHihWoPihKDhPaxCh0au/paSdGB7K8. Please contact your system administrator. Add correct host key in *HOME*/.ssh/known_hosts to get rid of this message. Offending ECDSA key in *HOME*/.ssh/known_hosts:*LINE* ECDSA host key for login-stretch.tools.wmflabs.org has changed and you have requested strict checking.
Host key verification failed.
then you will need to update your known_hosts file. It probably contains a line like this:
login-stretch.tools.wmflabs.org,185.15.56.48 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEEMihgdO9CXKJvpoO4LMOt1cU43zIQJiXOm1doVMh0z+uXntQkNDyF\ eHJ9//T983eL8efbCBEgnB9POGfYfoas=
You can either change this to
login-stretch.tools.wmflabs.org,185.15.56.48 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBFnJSjCGW7kli+cdgtmndPAl4xLZNc9uqP9KWlsnVDqr8yQ2RkR5ACb\ Xe6XZ+dS09Wc9ulOmGTOwCImMi9Fho78=
or remove the line and then look for the following output the next time you SSH into the bastion:
The authenticity of host 'login-stretch.tools.wmflabs.org (185.15.56.48)' can't be established. ECDSA key fingerprint is SHA256:8fLy4F9XDYdR/uHihWoPihKDhPaxCh0au/paSdGB7K8.
Good luck! Cheers, Lucas
Am Mi., 13. Feb. 2019 um 22:53 Uhr schrieb Bryan Davis < bd808@wikimedia.org>:
On Wed, Feb 13, 2019 at 2:45 PM Maximilian Doerr maximilian.doerr@gmail.com wrote:
I ask this because of these failures. Where does cyberbot-db-01 live?
Per https://tools.wmflabs.org/openstack-browser/project/cyberbot it is on cloudvirt1023.eqiad.wmnet
The data on there is critical.
As you probably know, we do not currently have a trusted back up solution for Cloud VPS projects. Our best recommendation for 'critical' data is for you to setup some manual or automated backup to an offsite location (your laptop, a VPS hosted outside Cloud VPS, etc). Hopefully we will have some news on an actual reliable backup service in the coming months. We have some hardware to build an initial system for this, but have not yet had time to design and implement the backup service itself.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
-- Lucas Werkmeister Full Stack Developer
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 https://wikimedia.de
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us to achieve our vision! https://spenden.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
I have evacuated almost all the VMs on cloudvirt1024 and cloudvirt1018. In theory these are all now up and running on different hardware.
The list of affected VMs can be found below in the context for this message. Instances previously hosted on cloudvirt1024 should be up and running and largely unaffected by the move.
Nearly every instance that was on cloudvirt1018 has suffered some degree of disk corruption. For the most part I've repaired them enough to allow logins, but I recommend that you check them extensively before relying on data integrity there. In some cases you may find part or all of your misplaced files in /lost+found.
Two VMs are still mid-copy, due to being enormous... I'm going to leave them to continue the evacuation overnight. They are 'mwoffliner5.mwoffliner.eqiad.wmflabs' and 'pub2.wikiapiary.eqiad.wmflabs'. They may come up during the night but I recommend against logging into them or restarting them until I've had a chance to run disk repair on them in the morning.
If you have specific issues with VMs from cloudvirt1018, feel free to seek help or advice in #wikimedia-cloud on IRC -- I expect Arturo will appear there in a few hours.
-Andrew
On 2/13/19 1:50 PM, Andrew Bogott wrote:
Now cloudvirt1024 is dying in earnest, so VMs hosted there will be down for a while as well. This is, as far as anyone can tell, just a stupid coincidence.
So far it appears that we are going to be able to rescue /most/ things without significant data loss. For now, though, there's going to be plenty more downtime.
VMs on cloudvirt1024 are:
| 8113d2c5-6788-43f6-beeb-123b0b717af3 | drmf-beta | math | 169b3260-4f7e-43dc-94c2-e699308a3426 | ecmabot | webperf | 29e875e3-15d5-4f74-9716-c0025c2ea098 | encoding02 | video | 1b2b8b50-d463-4b7f-a3a9-6363eeb3ca8b | encoding03 | video | 5421f938-7a11-499c-bc6a-534da1f4e27d | hafnium | rcm | 041d42b9-df36-4176-9f5d-a508989bbebc | hound-app-01 | hound | 6149375b-8a08-4f03-882a-6fc0f5f77499 | integration-slave-docker-1044 | integration | 4d64b032-d93a-4a8c-a7e5-569c17e5063f | integration-slave-docker-1046 | integration | ad48959a-9eb9-46a9-bec4-a2bf23cdf655 | integration-slave-docker-1047 | integration | 21644632-0972-448f-83d0-b76f9d1d28e0 | ldfclient-new | wikidata-query | c2a30fe0-2c87-4b01-be53-8e2a3d0f40a7 | math-docker | math | df8f17fb-03fe-4725-b9cf-3d9fe76f4654 | mediawiki2latex | collection-alt-renderer | d73f36e6-7534-4910-9a6e-64a6b9088d1e | neon | rcm | 2d035965-ba53-41b3-b6ef-d2ebbe50656a | novaadminmadethis | quotatest | c84f61c0-4fd2-47a5-b6ab-dd6b5ea98d41 | ores-puppetmaster-01 | ores | 585bb328-8078-4437-b076-9e555683e27d | ores-sentinel-01 | ores | 0538bfed-d7b5-4751-9431-8feecbaf78c0 | oxygen | rcm | e8090d9e-7529-46a9-b1e1-c4ba523a2898 | packaging | thumbor | c7fe4663-7f2b-4d23-a79b-1a2e01c80d93 | twlight-prod | twl | 2370b38f-7a65-4ccf-a635-7a2fa5e12b3e | twlight-staging | twl | 464577c6-86f0-42f9-9c49-86f9ec9a0210 | twlight-tracker | twl | 5325322d-a57e-4a9b-85b7-37643f03bfea | wikidata-misc | wikidata-dev
On 2/13/19 11:23 AM, Andrew Bogott wrote:
Here's the latest:
cloudvirt1018 is up and running, and many of its VMs are fine. Many other VMs are corrupted and won't start up. Some of those VMs will probably be lost for good, but we're still investigating rescue options.
In the meantime, if your VM is up and you can access it then you're in luck! If not, stay tuned.
-Andrew
On 2/13/19 9:15 AM, Andrew Bogott wrote:
I spoke too soon -- we're still working on this. Most of these VMs will remain down in the meantime.
Sorry for the outage!
On 2/13/19 8:21 AM, Andrew Bogott wrote:
We don't fully understand what happened, but after Giovanni performed a classic "turning it off and on again" things are now running without warnings. The VMs listed below are now coming back online and everything should be back up shortly.
We'll probably replace some of this hardware anyway, out of an abundance of caution, but that's unlikely to produce further downtime. With luck, this is the last you'll hear about this.
-Andrew
On 2/13/19 7:25 AM, Andrew Bogott wrote:
We're currently experiencing a mysterious hareware failure in our datacenter -- three different SSDs failed overnight, two of them in cloudvirt1018 and one of them in cloudvirt1024. The VMs on 1018 are down entirely. We may move those on 1024 to another host shortly in order to guard against additional drive failure.
There's some possibility that we will experience permanent data loss on cloudvirt1018, but everyone is working hard to avoid this.
The following VMs are on cloudvirt1018:
a11y | reading-web-staging abogott-scapserver | testlabs af-puppetdb01 | automation-framework api | openocr asdf | quotatest bastion-eqiad1-02 | bastion clm-test-01 | community-labs-monitoring compiler1002 | puppet-diffs cyberbot-exec-iabot-01 | cyberbot deployment-db03 | deployment-prep deployment-db04 | deployment-prep deployment-memc05 | deployment-prep deployment-pdfrender02 | deployment-prep deployment-sca01 | deployment-prep design-lsg3 | design eventmetrics-dev01 | eventmetrics fridolin | catgraph gtirloni-puppetmaster-01 | testlabs hadoop-master-3 | analytics ign | ign2commons integration-castor03 | integration integration-slave-docker-1017 | integration integration-slave-docker-1033 | integration integration-slave-docker-1038 | integration integration-slave-jessie-1003 | integration integration-slave-jessie-android | integration k8s-master-01 | general-k8s k8s-node-03 | general-k8s k8s-node-05 | general-k8s k8s-node-06 | general-k8s kdc | analytics labstash-jessie1 | logging language-mleb-legacy | language login-test | catgraph lsg-01 | design mathosphere | math mc-clusterA-1 | test-twemproxy mwoffliner5 | mwoffliner novaadminmadethis-4 | quotatest ntp-01 | cloudinfra ntp-02 | cloudinfra ogvjs-testing | ogvjs-integration phragile-pro | phragile planet-hotdog | planet pub2 | wikiapiary puppenmeister | planet puppet-compiler-v4-other | testlabs puppet-compiler-v4-tools | testlabs quarry-beta-01 | quarry signwriting-swis | signwriting signwriting-swserver | signwriting social-tools3 | social-tools striker-deploy04 | striker striker-puppet01 | striker t166878 | otrs togetherjs | visualeditor tools-sgebastion-06 | tools tools-sgeexec-0902 | tools tools-sgeexec-0903 | tools tools-sgewebgrid-generic-0901 | tools tools-sgewebgrid-lighttpd-0901 | tools ve-font | design wikibase1 | sciencesource wikicitevis-prod | wikicitevis wikifarm | pluggableauth women-in-red | globaleducation
_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce