Re: Image tarball

List overview All Threads
Download

newer

older

timeouts

Anglo-Saxon wiki

Andy Rabagliati

12 Oct 2004 12 Oct '04

8:55 p.m.

On Thu, 10 Jun 2004, Tim Starling wrote:

...

Hi,

This is just a one-off thing at the moment, I haven't set up scripts to do it on a regular basis. So if I forget, please remind me. I've made a tarball of all the images from the English Wikipedia. See the bottom of:

http://download.wikimedia.org/

I had a quick discussion on IRC about the wording of the legal statement, it seems we can't really be sure that it's legal to do anything at all with them. It's an unsatisfactory situation, in my opinion, but there you have it. So the disclaimer says "use at your own risk".

Folks,

Just to keep you up to date, we have had great success with wikipedia installations in schools in South Africa.

I use these images for the install. I have downloaded more recent database snapshots to match with it, but for the most part if they cannot make a difference by reposting changes (which they cannot, as it is quite isolated in bandwidth-space) I just put a June snapshot (same date as the pictures) on.

I am excited by the potential of carrying snapshots and image diffs, probably selected by tar using the file dates. We (Wizzy Digital Courier) can carry these by UUCP - using dialup connections or physical carrying on a USB memory stick.

I split them up into 6 CDs, but found that one whole CD was the thumb/ directory , so you can skip that :-)

I would also love to preserve the 'newsy' feel of the front page. I had a suggestion fron Sj that I grab the front page conventionally and patch the URLs to point to a local wikipedia installation.

If you could put up a tarball of the whole archive again I can rsync it this way. Or - put up an rsync server ?

That would also handle deletion of orphaned images.

Cheers, Andy!

http://wizzy.org.za/ (not been updated in a while)

http://www.slug.org.za/ (Shuttleworth Foundation project putting Open Source into schools)

Show replies by date

Tim Starling

13 Oct 13 Oct

7:41 a.m.

New subject: Image tarball

Andy Rabagliati wrote:

...

Folks,

Just to keep you up to date, we have had great success with wikipedia installations in schools in South Africa.

I use these images for the install. I have downloaded more recent database snapshots to match with it, but for the most part if they cannot make a difference by reposting changes (which they cannot, as it is quite isolated in bandwidth-space) I just put a June snapshot (same date as the pictures) on.

I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on. I'll see what I can do regarding rsync or diffs.

I haven't yet added the tarballs to the download.wikimedia.org web page. They can be found by browsing the following directories:

http://download.wikimedia.org/archives/ http://download.wikimedia.org/archives_wiktionary/ http://download.wikimedia.org/archives_wikiquote/ http://download.wikimedia.org/archives_wikibooks/ http://download.wikimedia.org/archives_special/

The upload.tar files are symlinks to the most recent date-specific upload tarball, e.g. 20041013_upload.tar

The legal information regarding tarballs on download.wikimedia.org still applies, quoted below:

"Unlike the article text, many images are not released under GFDL or the public domain. These images are owned by external parties who may not have consented to their use in Wikipedia. Wikipedia uses such images under the doctrine of fair use under United States law. Use of such images outside the context of Wikipedia or similar works may be illegal. Also, many images legally require a credit or other attached copyright information, and this copyright information is contained within the text dumps above. Some images may be restricted to non-commercial use, or may even be licensed exclusively to Wikipedia. Hence, download these images at your own risk."

-- Tim Starling

andyr＠wizzy.com

11:15 p.m.

New subject: Image tarball

Andy Rabagliati wrote:

...

...
Just to keep you up to date, we have had great success with wikipedia installations in schools in South Africa.

On Wed, 13 Oct 2004, Tim Starling wrote:

...

I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.

14Gig - Owww.

In a few short months it has grown from 3Gig.

I (knapsack)-package it as

wikipedia-images-11-2004.06.01-wiz.noarch.rpm wikipedia-images-96-2004.06.01-wiz.noarch.rpm wikipedia-images-6b-2004.06.01-wiz.noarch.rpm wikipedia-images-f9-2004.06.01-wiz.noarch.rpm wikipedia-images-70-2004.06.01-wiz.noarch.rpm wikipedia-images-fa-2004.06.01-wiz.noarch.rpm wikipedia-images-83-2004.06.01-wiz.noarch.rpm

...

on 4 CD disks with

wikipedia-tables-2004.03.27-wiz.noarch.rpm (320M) (en edition)

I ignore anything in the tarballs that doesn't match [0-9a-f]/../

Cuts out a lot of dross.

We are moving into DVD space.

I had all three LOTR books on my PalmIIIc with 8Meg RAM earlier this year.

I think there is a market for an image-compressed archive, more compatible with my palm pilot. I realise the problem is about licensing.

[I bought a new laptop the other day - IR and serial is 'legacy' now ..]

How are we going to carry the bandwidth around?

Cheers, Andy!

http://wizzy.org.za/

15 Oct 15 Oct

4:46 p.m.

New subject: Image tarball

...

I ignore anything in the tarballs that doesn't match [0-9a-f]/../ Cuts out a lot of dross. We are moving into DVD space.

< How are we going to carry the bandwidth around?

Another reason to have rough estimations of content quality, and also for content depth -- so people who need 100M of the most important content can get it. We could also offer thumbnail-only images for a reduced image tarball...

-- +sj+

Andre Engels

10:21 p.m.

New subject: Image tarball

On Wed, 13 Oct 2004 23:15:59 +0200, andyr@wizzy.com andyr@wizzy.com wrote:

...

On Wed, 13 Oct 2004, Tim Starling wrote:

...
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.

14Gig - Owww.

In a few short months it has grown from 3Gig.

I assume the cause of that is the new image syntax: It used to be that if you had a large image, you'd make it smaller (which also decreased its file size). Now it is put on the site in large version, and then made smaller to the user with the '000px' markup. Which means that there are much more large (sometimes huge) image files. Andre Engels

Andy Rabagliati

16 Oct 16 Oct

1:25 p.m.

New subject: Image tarball

On Fri, 15 Oct 2004, Andre Engels wrote:

...

On Wed, 13 Oct 2004 23:15:59 +0200, andyr@wizzy.com andyr@wizzy.com wrote:

...
On Wed, 13 Oct 2004, Tim Starling wrote:

...
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.

14Gig - Owww.

In a few short months it has grown from 3Gig.

Sorry - the cron jobs had not run and I was looking at the full db archive.

Now I see the pictures - 8Gig - not quite so bad :-)

...

I assume the cause of that is the new image syntax: It used to be that if you had a large image, you'd make it smaller (which also decreased its file size). Now it is put on the site in large version, and then made smaller to the user with the '000px' markup. Which means that there are much more large (sometimes huge) image files.

It will take me a week or so to get a good look at these - but - a question for the developers - am I right to only accept files matching ./en/[0-9a-f]/../* from the archive ?

Presumably uploads are just hashed into these dirs ?

There are a few pics that come with the mediawiki software that I would, naturally, leave alone.

In the first (Jun) archive /thumb/* was about 700Meg, and /archive/* was similar. There were also a lot of encyclopedia pics in the root dir - I threw them all away without noticing anything untoward.

I might run a script over the archive and convert large images to ones of the same size but, say, 70% quality. I imagine I could easily halve the archive size that way.

If there are other regexes that would catch files resized by the server I would be very grateful for the hint.

Currently I am getting the archive to a US server, unpacking, throwing away, and then rsyncing down to a friendly server in South Africa.

Cheers, Andy!

Tim Starling

2:55 p.m.

New subject: Image tarball

Andy Rabagliati wrote:

...

It will take me a week or so to get a good look at these - but - a question for the developers - am I right to only accept files matching ./en/[0-9a-f]/../* from the archive ?

Presumably uploads are just hashed into these dirs ?

Yes, that's correct. The directory name is derived from the MD5 hash of the filename.

...

There are a few pics that come with the mediawiki software that I would, naturally, leave alone.

In the first (Jun) archive /thumb/* was about 700Meg, and /archive/* was similar. There were also a lot of encyclopedia pics in the root dir - I threw them all away without noticing anything untoward.

In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.

...

I might run a script over the archive and convert large images to ones of the same size but, say, 70% quality. I imagine I could easily halve the archive size that way.

Quite likely.

...

If there are other regexes that would catch files resized by the server I would be very grateful for the hint.

The thumb directory contains all the images resized automatically, although the ./en/[0-9a-f] directories will contain some duplicate images resized by hand.

-- Tim Starling

Rowan Collins

17 Oct 17 Oct

1:52 a.m.

New subject: Image tarball

On Sat, 16 Oct 2004 22:55:00 +1000, Tim Starling ts4294967296@hotmail.com wrote:

...

In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.

Perhaps tar could be told to --exclude these, so they don't take up everyone's space and bandwidth - from my glance at 'info tar', something like this might do it:

tar --no-wildcards-match-slash --exclude "*.*"

(I know it looks like "exclude everything", but I'm assuming none of the directories will have a . in the middle of their names, and that first argument should be pretty self-explanatory...)

Of course, for all I know, there's stuff in that root that *is* useful in the tar-ball, in which case the appropriate args may be a lot more complex.

-- Rowan Collins BSc [IMSoP]

Rowan Collins

6:35 p.m.

New subject: Image tarball

On Sun, 17 Oct 2004 00:52:47 +0100, Rowan Collins rowan.collins@gmail.com wrote:

...

On Sat, 16 Oct 2004 22:55:00 +1000, Tim Starling ts4294967296@hotmail.com wrote:

...
In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.

Perhaps tar could be told to --exclude these, so they don't take up everyone's space and bandwidth - from my glance at 'info tar', something like this might do it:

tar --no-wildcards-match-slash --exclude "*.*"

Ahem! Better make that: tar -cf $FILE_OUT $IMG_PATH --no-wildcards-match-slash --exclude "$IMG_PATH/*.*"

[probably ;)]

-- Rowan Collins BSc [IMSoP]

Andy Rabagliati

18 Oct 18 Oct

1:54 p.m.

New subject: Image tarball

On Sun, 17 Oct 2004, Rowan Collins wrote:

...

Ahem! Better make that: tar -cf $FILE_OUT $IMG_PATH --no-wildcards-match-slash --exclude "$IMG_PATH/*.*"

Actually, a simple

tar cf $FILE_OUT en/[0-9a-f]/

would work for me.

The tarball comes from 7.5Gig down to 5.3Gig with that.

Thanks again to Tim Starling for making this available.

There is some duplication of pictures in the archive, and I am not sure if orphaned pictures are also in this tarball - I think that they are.

Any chance of providing a list of orphaned pics so I can scrub them out too ?

Cheers, Andy!

7367

Age (days ago)

7373

Last active (days ago)

wikipedia-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Andre Engels
Andy Rabagliati
andyr＠wizzy.com
Rowan Collins
Sj
Tim Starling