On Fri, Sep 5, 2008 at 8:54 AM, Tim Starling tstarling@wikimedia.org wrote:
If it helps, this file has the hashes already: http://noc.wikimedia.org/~tstarling/pass-3-targets-hashes
Thanks. Saved me a step… and fortunately I already had base conversion code handy.
Sadly, it takes a long time to SHA1 many tbytes of data. I started the process this morning, but I had made an error in assuming the xargs parallel argument (-P) wouldn't result in badly interleaved output, since it didn't in a limited test. Turns out it did so I had to start the hashing over again.
(Might I suggest, beyond not invoking unlink() that if your filesystem can handle some additional inode pressure that you make daily or weekly hardlink snapshots in a directory tree inaccessible to the web front end? It's not as good as a real backup system, but it's cheap and easy. On my system (xfs) I have a dozen or so hardlink snapshots of the Wikimedia image collection: while I was getting updates I was creating snapshots which roughly coincided with the released database dumps)
Since the hashing is going to take a while I'll hop on IRC and pass you a link to a tar with the file name matches. Turns out that I have *most* of them based on name match alone. (dunno why my earlier count was wrong… perhaps a unicode handling bug on my part, I'd just woken up when I sent my prior email)