On Fri, Nov 18, 2011 at 3:41 PM, Ben Hartshorne
<bhartshorne(a)wikimedia.org>wrote;wrote:
Answering a few questions in one place.
On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber <brion(a)wikimedia.org>
wrote:
Hmm... what I'd expect is that if one ES save target database is in
read-only, the system should cycle through to the next available one that
is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES
clusters
getting locked somehow?
The last step of the maintenance was to switch the master for article
writes from ms3 to es3. In order to make sure no data is lost during
the transition, I marked the master read-only for the duration of the
switch. Given that there is only one ES target database to which
writes are sent (currently es3), there is nowhere to which to
failover. (All slaves run read-only all the time.)
*nod* logical enough. For the future I'd recommend planning a temporary
'holding zone' cluster that would be used only during the changeover -- it
would remain read-write while the main ones are being copied.
Then after switching writes to the new targets, the holding zone can go
read-only while it gets copied over to the new target, which should go
relatively fast.
This would be just another part of the ES system rather than a separate
cache, so should remain reasonably robust: if something goes awry with the
main copy to the new clusters, you can safely stop: the holding zone will
just sits with the old servers and can just keep running like the other ES
clusters, unlike some sort of cache which might lose data.
-- brion