Andy Rabagliati wrote:
It will take me a week or so to get a good look at these - but - a question for the developers - am I right to only accept files matching ./en/[0-9a-f]/../* from the archive ?
Presumably uploads are just hashed into these dirs ?
Yes, that's correct. The directory name is derived from the MD5 hash of the filename.
There are a few pics that come with the mediawiki software that I would, naturally, leave alone.
In the first (Jun) archive /thumb/* was about 700Meg, and /archive/* was similar. There were also a lot of encyclopedia pics in the root dir - I threw them all away without noticing anything untoward.
In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.
I might run a script over the archive and convert large images to ones of the same size but, say, 70% quality. I imagine I could easily halve the archive size that way.
Quite likely.
If there are other regexes that would catch files resized by the server I would be very grateful for the hint.
The thumb directory contains all the images resized automatically, although the ./en/[0-9a-f] directories will contain some duplicate images resized by hand.
-- Tim Starling