On Thu, 10 Jun 2004, Tim Starling wrote:
Hi,
This is just a one-off thing at the moment, I haven't set up scripts to do it on a regular basis. So if I forget, please remind me. I've made a tarball of all the images from the English Wikipedia. See the bottom of:
http://download.wikimedia.org/
I had a quick discussion on IRC about the wording of the legal statement, it seems we can't really be sure that it's legal to do anything at all with them. It's an unsatisfactory situation, in my opinion, but there you have it. So the disclaimer says "use at your own risk".
Folks,
Just to keep you up to date, we have had great success with wikipedia installations in schools in South Africa.
I use these images for the install. I have downloaded more recent database snapshots to match with it, but for the most part if they cannot make a difference by reposting changes (which they cannot, as it is quite isolated in bandwidth-space) I just put a June snapshot (same date as the pictures) on.
I am excited by the potential of carrying snapshots and image diffs, probably selected by tar using the file dates. We (Wizzy Digital Courier) can carry these by UUCP - using dialup connections or physical carrying on a USB memory stick.
I split them up into 6 CDs, but found that one whole CD was the thumb/ directory , so you can skip that :-)
I would also love to preserve the 'newsy' feel of the front page. I had a suggestion fron Sj that I grab the front page conventionally and patch the URLs to point to a local wikipedia installation.
If you could put up a tarball of the whole archive again I can rsync it this way. Or - put up an rsync server ?
That would also handle deletion of orphaned images.
Cheers, Andy!
http://wizzy.org.za/ (not been updated in a while)
http://www.slug.org.za/ (Shuttleworth Foundation project putting Open Source into schools)
Andy Rabagliati wrote:
Folks,
Just to keep you up to date, we have had great success with wikipedia installations in schools in South Africa.
I use these images for the install. I have downloaded more recent database snapshots to match with it, but for the most part if they cannot make a difference by reposting changes (which they cannot, as it is quite isolated in bandwidth-space) I just put a June snapshot (same date as the pictures) on.
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on. I'll see what I can do regarding rsync or diffs.
I haven't yet added the tarballs to the download.wikimedia.org web page. They can be found by browsing the following directories:
http://download.wikimedia.org/archives/ http://download.wikimedia.org/archives_wiktionary/ http://download.wikimedia.org/archives_wikiquote/ http://download.wikimedia.org/archives_wikibooks/ http://download.wikimedia.org/archives_special/
The upload.tar files are symlinks to the most recent date-specific upload tarball, e.g. 20041013_upload.tar
The legal information regarding tarballs on download.wikimedia.org still applies, quoted below:
"Unlike the article text, many images are not released under GFDL or the public domain. These images are owned by external parties who may not have consented to their use in Wikipedia. Wikipedia uses such images under the doctrine of fair use under United States law. Use of such images outside the context of Wikipedia or similar works may be illegal. Also, many images legally require a credit or other attached copyright information, and this copyright information is contained within the text dumps above. Some images may be restricted to non-commercial use, or may even be licensed exclusively to Wikipedia. Hence, download these images at your own risk."
-- Tim Starling
Andy Rabagliati wrote:
Just to keep you up to date, we have had great success with wikipedia installations in schools in South Africa.
On Wed, 13 Oct 2004, Tim Starling wrote:
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.
14Gig - Owww.
In a few short months it has grown from 3Gig.
I (knapsack)-package it as
wikipedia-images-11-2004.06.01-wiz.noarch.rpm wikipedia-images-96-2004.06.01-wiz.noarch.rpm wikipedia-images-6b-2004.06.01-wiz.noarch.rpm wikipedia-images-f9-2004.06.01-wiz.noarch.rpm wikipedia-images-70-2004.06.01-wiz.noarch.rpm wikipedia-images-fa-2004.06.01-wiz.noarch.rpm wikipedia-images-83-2004.06.01-wiz.noarch.rpm
...
on 4 CD disks with
wikipedia-tables-2004.03.27-wiz.noarch.rpm (320M) (en edition)
I ignore anything in the tarballs that doesn't match [0-9a-f]/../
Cuts out a lot of dross.
We are moving into DVD space.
I had all three LOTR books on my PalmIIIc with 8Meg RAM earlier this year.
I think there is a market for an image-compressed archive, more compatible with my palm pilot. I realise the problem is about licensing.
[I bought a new laptop the other day - IR and serial is 'legacy' now ..]
How are we going to carry the bandwidth around?
Cheers, Andy!
I ignore anything in the tarballs that doesn't match [0-9a-f]/../ Cuts out a lot of dross. We are moving into DVD space.
< How are we going to carry the bandwidth around?
Another reason to have rough estimations of content quality, and also for content depth -- so people who need 100M of the most important content can get it. We could also offer thumbnail-only images for a reduced image tarball...
On Wed, 13 Oct 2004 23:15:59 +0200, andyr@wizzy.com andyr@wizzy.com wrote:
On Wed, 13 Oct 2004, Tim Starling wrote:
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.
14Gig - Owww.
In a few short months it has grown from 3Gig.
I assume the cause of that is the new image syntax: It used to be that if you had a large image, you'd make it smaller (which also decreased its file size). Now it is put on the site in large version, and then made smaller to the user with the '000px' markup. Which means that there are much more large (sometimes huge) image files. Andre Engels
On Fri, 15 Oct 2004, Andre Engels wrote:
On Wed, 13 Oct 2004 23:15:59 +0200, andyr@wizzy.com andyr@wizzy.com wrote:
On Wed, 13 Oct 2004, Tim Starling wrote:
I've added the image tarball generation to the backup script, so new tarballs will be generated every week from now on.
14Gig - Owww.
In a few short months it has grown from 3Gig.
Sorry - the cron jobs had not run and I was looking at the full db archive.
Now I see the pictures - 8Gig - not quite so bad :-)
I assume the cause of that is the new image syntax: It used to be that if you had a large image, you'd make it smaller (which also decreased its file size). Now it is put on the site in large version, and then made smaller to the user with the '000px' markup. Which means that there are much more large (sometimes huge) image files.
It will take me a week or so to get a good look at these - but - a question for the developers - am I right to only accept files matching ./en/[0-9a-f]/../* from the archive ?
Presumably uploads are just hashed into these dirs ?
There are a few pics that come with the mediawiki software that I would, naturally, leave alone.
In the first (Jun) archive /thumb/* was about 700Meg, and /archive/* was similar. There were also a lot of encyclopedia pics in the root dir - I threw them all away without noticing anything untoward.
I might run a script over the archive and convert large images to ones of the same size but, say, 70% quality. I imagine I could easily halve the archive size that way.
If there are other regexes that would catch files resized by the server I would be very grateful for the hint.
Currently I am getting the archive to a US server, unpacking, throwing away, and then rsyncing down to a friendly server in South Africa.
Cheers, Andy!
Andy Rabagliati wrote:
It will take me a week or so to get a good look at these - but - a question for the developers - am I right to only accept files matching ./en/[0-9a-f]/../* from the archive ?
Presumably uploads are just hashed into these dirs ?
Yes, that's correct. The directory name is derived from the MD5 hash of the filename.
There are a few pics that come with the mediawiki software that I would, naturally, leave alone.
In the first (Jun) archive /thumb/* was about 700Meg, and /archive/* was similar. There were also a lot of encyclopedia pics in the root dir - I threw them all away without noticing anything untoward.
In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.
I might run a script over the archive and convert large images to ones of the same size but, say, 70% quality. I imagine I could easily halve the archive size that way.
Quite likely.
If there are other regexes that would catch files resized by the server I would be very grateful for the hint.
The thumb directory contains all the images resized automatically, although the ./en/[0-9a-f] directories will contain some duplicate images resized by hand.
-- Tim Starling
On Sat, 16 Oct 2004 22:55:00 +1000, Tim Starling ts4294967296@hotmail.com wrote:
In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.
Perhaps tar could be told to --exclude these, so they don't take up everyone's space and bandwidth - from my glance at 'info tar', something like this might do it:
tar --no-wildcards-match-slash --exclude "*.*"
(I know it looks like "exclude everything", but I'm assuming none of the directories will have a . in the middle of their names, and that first argument should be pretty self-explanatory...)
Of course, for all I know, there's stuff in that root that *is* useful in the tar-ball, in which case the appropriate args may be a lot more complex.
On Sun, 17 Oct 2004 00:52:47 +0100, Rowan Collins rowan.collins@gmail.com wrote:
On Sat, 16 Oct 2004 22:55:00 +1000, Tim Starling ts4294967296@hotmail.com wrote:
In the real root directory there's symlinks to images in the other directories, apparently left there to avoid breaking URLs used in an earlier version of the software. Obviously tar has converted them from symlinks to duplicates. You can delete them.
Perhaps tar could be told to --exclude these, so they don't take up everyone's space and bandwidth - from my glance at 'info tar', something like this might do it:
tar --no-wildcards-match-slash --exclude "*.*"
Ahem! Better make that: tar -cf $FILE_OUT $IMG_PATH --no-wildcards-match-slash --exclude "$IMG_PATH/*.*"
[probably ;)]
On Sun, 17 Oct 2004, Rowan Collins wrote:
Ahem! Better make that: tar -cf $FILE_OUT $IMG_PATH --no-wildcards-match-slash --exclude "$IMG_PATH/*.*"
Actually, a simple
tar cf $FILE_OUT en/[0-9a-f]/
would work for me.
The tarball comes from 7.5Gig down to 5.3Gig with that.
Thanks again to Tim Starling for making this available.
There is some duplication of pictures in the archive, and I am not sure if orphaned pictures are also in this tarball - I think that they are.
Any chance of providing a list of orphaned pics so I can scrub them out too ?
Cheers, Andy!
wikipedia-l@lists.wikimedia.org