Dumps, dumps, dumps

List overview All Threads
Download

newer

older

testing one phase at a time...

Fwd: Dumps, dumps, dumps

emijrp

13 Aug 2010 13 Aug '10

6:17 p.m.

Hi all;

Yesterday, I wrote a post[1] with some links to current dumps, old dumps, and another raw data like Domas visits logs. Also, some links to Internet Archive where we can download some historical dumps. Please, can you share your links?

Also, what about making a tarball with thumbnails from Commons? 800x600 would be a nice (re)-solution, to avoid a TB dump. If not, probably it will never be published an image dump. Commons is growing ~5000 images per day. It is scaring.

Regards, emijrp

[1] http://emijrp.blogspot.com/2010/08/wikipedia-dumps.html

Attachments:

attachment.htm (text/html — 673 bytes)

Show replies by date

emijrp

15 Aug 15 Aug

2:56 a.m.

Hi again;

Today, I have discovered the design of the early Wikipedia database download page, available in Internet Archive.[1]

Surprisingly you can download some of the dumps, but of course, only the tiniest ones. For example, the old table dump for cs.wikipedia.org[2] which is 595 KB and expands to 4.1 MB.

They are not so "usable", only for data hoarders ; ) and nostalgia time. A piece of (Internet|Wikipedia) history.

Regards, emijrp

[1] http://web.archive.org/web/20031213055854/http://download.wikimedia.org/ [2] http://web.archive.org/web/20031213055854/http://download.wikimedia.org/arch...

2010/8/13 emijrp emijrp@gmail.com

...

Hi all;

Yesterday, I wrote a post[1] with some links to current dumps, old dumps, and another raw data like Domas visits logs. Also, some links to Internet Archive where we can download some historical dumps. Please, can you share your links?

Also, what about making a tarball with thumbnails from Commons? 800x600 would be a nice (re)-solution, to avoid a TB dump. If not, probably it will never be published an image dump. Commons is growing ~5000 images per day. It is scaring.

Regards, emijrp

[1] http://emijrp.blogspot.com/2010/08/wikipedia-dumps.html

Jamie Morken

12:55 p.m.

Hi,

----- Original Message ----- From: emijrp emijrp@gmail.com Date: Friday, August 13, 2010 4:48 am Subject: [Xmldatadumps-l] Dumps, dumps, dumps To: xmldatadumps-l@lists.wikimedia.org

...

Hi all;

Yesterday, I wrote a post[1] with some links to current dumps, old dumps, and another raw data like Domas visits logs. Also, some links to InternetArchive where we can download some historical dumps. Please, can you share your links?

Also, what about making a tarball with thumbnails from Commons? 800x600would be a nice (re)-solution, to avoid a TB dump. If not, probably it will never be published an image dump. Commons is growing ~5000 images per day. It is scaring.

Yes publicly available tarballs of image dumps would be great. Here's what I think it would take to implement:

1. allocate the server space for the image tarballs 2. allocate the bandwidth for us to download them 3. decide what tarballs will be made available (ie. separated by wiki or whole commons, thumbnails or 800x600max, etc) 3. write the script(s) for collecting the image lists, automating the image scaling and creating the tarballs 4. done!

None of those tasks are really that difficult, the hard part is figuring out why there used to be tarball images available but not anymore, especially when apparently there is adequate server space and bandwidth. I guess it is one more thing that could break and then people would complain about it not working.

cheers, Jamie

...

Regards, emijrp

[1] http://emijrp.blogspot.com/2010/08/wikipedia-dumps.html

Ariel T. Glenn

1:44 p.m.

Στις 14-08-2010, ημέρα Σαβ, και ώρα 23:25 -0700, ο/η Jamie Morken έγραψε:

...

Hi,

----- Original Message ----- From: emijrp emijrp@gmail.com Date: Friday, August 13, 2010 4:48 am Subject: [Xmldatadumps-l] Dumps, dumps, dumps To: xmldatadumps-l@lists.wikimedia.org

...
Hi all;

Yesterday, I wrote a post[1] with some links to current dumps, old dumps, and another raw data like Domas visits logs. Also, some links to InternetArchive where we can download some historical dumps. Please, can you share your links?

Also, what about making a tarball with thumbnails from Commons? 800x600would be a nice (re)-solution, to avoid a TB dump. If not, probably it will never be published an image dump. Commons is growing ~5000 images per day. It is scaring.

Yes publicly available tarballs of image dumps would be great. Here's what I think it would take to implement:

allocate the server space for the image tarballs

allocate the bandwidth for us to download them

decide what tarballs will be made available (ie. separated by wiki

or whole commons, thumbnails or 800x600max, etc) 3. write the script(s) for collecting the image lists, automating the image scaling and creating the tarballs 4. done!

None of those tasks are really that difficult, the hard part is figuring out why there used to be tarball images available but not anymore, especially when apparently there is adequate server space and bandwidth. I guess it is one more thing that could break and then people would complain about it not working.

Images take up 8T or more these days (of course that includes deletes and earlier versions but those aren't the bulk of it). Hosting 8T tarballs seems out of the question... who would download them anyways?

Having said that, hosting small subsets of images is qute possible and is something that has been discussed in the past. I would love to hear which subsets of images people want and would actually use.

Ariel

Jamie Morken

10:43 p.m.

Hi,

----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Sunday, August 15, 2010 12:15 am Subject: Re: [Xmldatadumps-l] Dumps, dumps, dumps To: Jamie Morken jmorken@shaw.ca Cc: emijrp emijrp@gmail.com, xmldatadumps-l@lists.wikimedia.org

...

Images take up 8T or more these days (of course that includes deletes and earlier versions but those aren't the bulk of it). Hosting 8T tarballs seems out of the question... who would download them anyways?

Having said that, hosting small subsets of images is qute possible and is something that has been discussed in the past. I would love to hear which subsets of images people want and would actually use.

There is the script wikix that people have used to manually download images from wikis:

http://meta.wikimedia.org/wiki/Wikix

It generates a list of all the images in an XML dump and then downloads them. The only thing missing is the image scaling, without that the enwiki image dump will be too large for most people to use right now. ImageMagick, http://en.wikipedia.org/wiki/ImageMagick could work to scale the various formats of images to smaller sizes.

Here's a script snippet I found using it in the bash shell:

#!/bin/sh find /media/SHAWN\ IPOD/Songs/ -iname "*.png"| while read file; do convert -size 75x75 "$file" -resize 100x100 "cover.bmp" cp cover.bmp "${file%/*}"/. done

If wikimedia foundation provides a dump of images I think people will find good ways to use them in interesting ways. Dumps of enwiki images with a max size of 640x480 or 800x600 and also enwiki thumbnails are the two subsets I think would be most valuable.

cheers, Jamie

...

Ariel

5241

Age (days ago)

5243

Last active (days ago)

xmldatadumps-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Ariel T. Glenn
emijrp
Jamie Morken