Daniel Kinzler wrote:
A more conventional solution would be to have a two
more copies of the files, on
the same server, which are synced every, say, 24 hours: backup a -> backup b,
live mirror -> back a. But this would require three times the space. Considering
we have 5TB worth of media files currenlty (does this include thumbnails?), and
the new server will have 24TB of space, this could work for a while. But taking
into account exponential growth, it wouldn't last long.
Tripling space requirements seems a bit of overkill. Maybe there's a smarter
Seems worth mentioning how I am currently replicating commons files.
First, there's a bot watching file uploads to scan them, so all files
are usually already at the box.
The I run a script to make quasi-snapshots of commons. They aren't real
snapshots, as I use the api and thus not an exact point in time.
Toolserver doesn't have that problem, as it keeps a commons db copy, it
can directly query a snapshot of the image table.
For each image, the scripts look for a copy on previous snapshots as
well as the uploads copy (verifying by the hash). Only a few iamges are
not found an thus need to be downloaded. All others are hardlinked.
As each download is done on a different folder, i get snapshots of
different points of time. Deleted images are simply not hardlinked.
The system has xfs, but the stript doesn't require special abilities on
the filesystem other than typical unix hardlinks, although a filesystem
without a fixed inode block is really encouraged.
You may spent some GB per snapshot in inodes (1GB per 4M files given a
inode size of 256) and some for folder contents, but that's completely
aceptable as size of new files you find per snapshot is one order of
Some caveats: oldimage table has 'unexpected' entries. Don't make
assumptions such as "a filename can't be twice" or "there will always
Of course, the code is available. If I can be of help... just ask :)