This incident is now over and services should be working as normal, although file access may be a bit slow while Ceph rebalances and recovers.
The original cause seems to have been a bad optical cable in the datacenter. We're preparing an incident doc and I'll send that along in a followup email.
-Andrew + wmcs team
On 6/11/24 10:15 AM, Andrew Bogott wrote:
There is as-of-yet undiagnosed issue with our storage system (ceph) which is causing serious failures throughout cloud-vps and toolforge.
Multiple people are working on the issue, so watch this list for updates. Sorry for the downtime!
-Andrew + wmcs team