On Fri, Nov 18, 2011 at 3:41 PM, Ben Hartshorne bhartshorne@wikimedia.orgwrote:
Answering a few questions in one place.
On Fri, Nov 18, 2011 at 10:29 AM, Brion Vibber brion@wikimedia.org wrote:
Hmm... what I'd expect is that if one ES save target database is in read-only, the system should cycle through to the next available one that is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES
clusters
getting locked somehow?
The last step of the maintenance was to switch the master for article writes from ms3 to es3. In order to make sure no data is lost during the transition, I marked the master read-only for the duration of the switch. Given that there is only one ES target database to which writes are sent (currently es3), there is nowhere to which to failover. (All slaves run read-only all the time.)
*nod* logical enough. For the future I'd recommend planning a temporary 'holding zone' cluster that would be used only during the changeover -- it would remain read-write while the main ones are being copied.
Then after switching writes to the new targets, the holding zone can go read-only while it gets copied over to the new target, which should go relatively fast.
This would be just another part of the ES system rather than a separate cache, so should remain reasonably robust: if something goes awry with the main copy to the new clusters, you can safely stop: the holding zone will just sits with the old servers and can just keep running like the other ES clusters, unlike some sort of cache which might lose data.
-- brion