Hi everyone!
TL;DR; Currently there's a degradation on the service for VMs and anything running on them (ex. toolforge, quarry, paws, ...), you might be able to use the services or they might become too slow, we are working on it and will update when fixed.
Long story:
We were adding a new ceph node to the ceph cluster. This time the node was in a different subnet, but ceph is supposed to be transparently able to work with many subnets. For some reason the new node was added to the cluster, but it's missing to reply to any heartbeats sent from any other nodes in the cluster and that causes the cluster to keep rebalancing data around, what creates a continuous IO slowness for any clients (like VMs).
We are trying to minimize the impact by limiting the amount of data that gets re-shuffled, that slows down the intervention a bit, but should improve the client experience.
We are actively working on this, and will update with any changes.
Cheers!
An update on this,
The cluster is currently stable, still in a not-healthy status. We are going to stop trying to restore the health until after the Wikimania (next Monday) to avoid disrupting users.
Let us know (by mail/irc/phabricator task/etc.) if you are seeing any issues.
Thanks for your patience and understanding!
On 08/11 13:32, David Caro wrote:
Hi everyone!
TL;DR; Currently there's a degradation on the service for VMs and anything running on them (ex. toolforge, quarry, paws, ...), you might be able to use the services or they might become too slow, we are working on it and will update when fixed.
Long story:
We were adding a new ceph node to the ceph cluster. This time the node was in a different subnet, but ceph is supposed to be transparently able to work with many subnets. For some reason the new node was added to the cluster, but it's missing to reply to any heartbeats sent from any other nodes in the cluster and that causes the cluster to keep rebalancing data around, what creates a continuous IO slowness for any clients (like VMs).
We are trying to minimize the impact by limiting the amount of data that gets re-shuffled, that slows down the intervention a bit, but should improve the client experience.
We are actively working on this, and will update with any changes.
Cheers!