Daniel Kinzler wrote:
A more conventional solution would be to have a two more copies of the files, on the same server, which are synced every, say, 24 hours: backup a -> backup b, live mirror -> back a. But this would require three times the space. Considering we have 5TB worth of media files currenlty (does this include thumbnails?), and the new server will have 24TB of space, this could work for a while. But taking into account exponential growth, it wouldn't last long.
Tripling space requirements seems a bit of overkill. Maybe there's a smarter solution. Ideas?
--daniel
Seems worth mentioning how I am currently replicating commons files.
First, there's a bot watching file uploads to scan them, so all files are usually already at the box. The I run a script to make quasi-snapshots of commons. They aren't real snapshots, as I use the api and thus not an exact point in time. Toolserver doesn't have that problem, as it keeps a commons db copy, it can directly query a snapshot of the image table.
For each image, the scripts look for a copy on previous snapshots as well as the uploads copy (verifying by the hash). Only a few iamges are not found an thus need to be downloaded. All others are hardlinked.
As each download is done on a different folder, i get snapshots of different points of time. Deleted images are simply not hardlinked. The system has xfs, but the stript doesn't require special abilities on the filesystem other than typical unix hardlinks, although a filesystem without a fixed inode block is really encouraged.
You may spent some GB per snapshot in inodes (1GB per 4M files given a inode size of 256) and some for folder contents, but that's completely aceptable as size of new files you find per snapshot is one order of magnitude greater.
Some caveats: oldimage table has 'unexpected' entries. Don't make assumptions such as "a filename can't be twice" or "there will always be a file".
Of course, the code is available. If I can be of help... just ask :)
Yours, Platonides