Current disruptions for VMs/toolforge, disk IO - Cloud-announce

11 Aug 2022

Hi everyone!

TL;DR;
Currently there's a degradation on the service for VMs and anything running on them
(ex. toolforge, quarry, paws,
...), you might be able to use the services or they might become too slow, we are working
on it and will update when
fixed.

Long story:

We were adding a new ceph node to the ceph cluster. This time the node was in a different
subnet, but ceph is supposed
to be transparently able to work with many subnets. For some reason the new node was added
to the cluster, but it's
missing to reply to any heartbeats sent from any other nodes in the cluster and that
causes the cluster to keep
rebalancing data around, what creates a continuous IO slowness for any clients (like
VMs).

We are trying to minimize the impact by limiting the amount of data that gets re-shuffled,
that slows down the
intervention a bit, but should improve the client experience.

We are actively working on this, and will update with any changes.

Cheers!

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."