Tomasz Finc wrote:
For future dumps, we might have to resort to some form of snapshot server that is fed all updates either from memcaches or mysqls. This allows for a live backup to be performed, so it's useful for not just dumps.
Possibly but the crux of it is simply the page text from external storage. Fetching the meta content while long is very short in the grand scheme of things.
Is it possible to suspend any individual slaves temporarily during off peak hours to flush the database to disk and then copy the database files to another computer? If not, we may still be able to use a "stale" database files copied to another computer, as long as we only use data from it that is at least a few days old, so we know that it's been flushed to disk (not sure how mysql flushes the data...).
Spinning down a slave wont help us much since external storage is the slowdown. But mirroring that content elsewhere might be the way to go. External storage by itself is just a set of mysql db's. I'm curious to see if there might be a better storage sub system to optimize for this.
AFAIK External Storage is used with direct assignation. A wiki gets assigned an ES cluster and uses it for a long period of time. Thus from one dump to the next all text will be at the same ES cluster with high probability. Could the dump be run on one of the current ES cluster slaves? Moving the old dump might be a greater problem than getting the articles (though it's much 'easier' to move, since everything is on a single data block and pipelines nicely) but the dumper machine could became an ES slave.