An update on this,
The cluster is currently stable, still in a not-healthy status. We are going to stop
trying to restore the health until
after the Wikimania (next Monday) to avoid disrupting users.
Let us know (by mail/irc/phabricator task/etc.) if you are seeing any issues.
Thanks for your patience and understanding!
On 08/11 13:32, David Caro wrote:
Hi everyone!
TL;DR;
Currently there's a degradation on the service for VMs and anything running on them
(ex. toolforge, quarry, paws,
...), you might be able to use the services or they might become too slow, we are working
on it and will update when
fixed.
Long story:
We were adding a new ceph node to the ceph cluster. This time the node was in a different
subnet, but ceph is supposed
to be transparently able to work with many subnets. For some reason the new node was
added to the cluster, but it's
missing to reply to any heartbeats sent from any other nodes in the cluster and that
causes the cluster to keep
rebalancing data around, what creates a continuous IO slowness for any clients (like
VMs).
We are trying to minimize the impact by limiting the amount of data that gets
re-shuffled, that slows down the
intervention a bit, but should improve the client experience.
We are actively working on this, and will update with any changes.
Cheers!
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."