On Fri, 15 Oct 2004, Andre Engels wrote:
On Wed, 13 Oct 2004 23:15:59 +0200, andyr@wizzy.com andyr@wizzy.com wrote:
On Wed, 13 Oct 2004, Tim Starling wrote:
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.
14Gig - Owww.
In a few short months it has grown from 3Gig.
Sorry - the cron jobs had not run and I was looking at the full db archive.
Now I see the pictures - 8Gig - not quite so bad :-)
I assume the cause of that is the new image syntax: It used to be that if you had a large image, you'd make it smaller (which also decreased its file size). Now it is put on the site in large version, and then made smaller to the user with the '000px' markup. Which means that there are much more large (sometimes huge) image files.
It will take me a week or so to get a good look at these - but - a question for the developers - am I right to only accept files matching ./en/[0-9a-f]/../* from the archive ?
Presumably uploads are just hashed into these dirs ?
There are a few pics that come with the mediawiki software that I would, naturally, leave alone.
In the first (Jun) archive /thumb/* was about 700Meg, and /archive/* was similar. There were also a lot of encyclopedia pics in the root dir - I threw them all away without noticing anything untoward.
I might run a script over the archive and convert large images to ones of the same size but, say, 70% quality. I imagine I could easily halve the archive size that way.
If there are other regexes that would catch files resized by the server I would be very grateful for the hint.
Currently I am getting the archive to a US server, unpacking, throwing away, and then rsyncing down to a friendly server in South Africa.
Cheers, Andy!