image dump

List overview All Threads
Download

newer

older

IEEE LOM

PHP script and edit size

james

1 Jun 2005 1 Jun '05

6:37 p.m.

just finished downloading the en image dump, tried extracting it with winrar - only to get an error - 'the archive is corupt' - though doing it with a different extractor, not only doesnt preserve the folders, there only seems to be 9200 images in the dump.... could well be more, as i havent extract it all, because of this folder thing

anyone of an extracter that does a folder intact job

once that is extracted - how will my wiki know where the images are for each artical, and thus include them within each artical?

thanks

Show replies by date

Rowan Collins

1 Jun 1 Jun

9:41 p.m.

On 01/06/05, james jamessampford@supanet.com wrote:

...

just finished downloading the en image dump, tried extracting it with winrar -

...

anyone of an extracter that does a folder intact job

http://download.wikimedia.org/images/README_ABOUT_FILE_FORMAT.txt mentions how to extract them correctly under *NIX system - so one way would be to get cygwin up and running. A quick search also turned up this Windows port of a "libarchive" which seems like it may include a compatible "tar" utility - http://gnuwin32.sourceforge.net/packages/libarchive.htm

...

once that is extracted - how will my wiki know where the images are for each artical, and thus include them within each artical?

For that, you need to download the "image" and "imagelinks" database tables from http://download.wikimedia.org/#en.wikipedia to go with your "cur" dump (the "imagelinks" one could probably be rebuilt programmatically, but that's likely to be slower than just downloading it).

-- Rowan Collins BSc [IMSoP]

Kate Turner

2 Jun 2 Jun

5:03 a.m.

Rowan Collins wrote in gmane.science.linguistics.wikipedia.technical:

...

On 01/06/05, james jamessampford@supanet.com wrote:

...
just finished downloading the en image dump, tried extracting it with winrar -

...

http://download.wikimedia.org/images/README_ABOUT_FILE_FORMAT.txt mentions how to extract them correctly under *NIX system - so one way would be to get cygwin up and running.

hmm. i completely forgot that people might want to extract the images under non-Unix systems... :-( the pax format is standard, but it's not widely used - it's not the same as GNU's version, although GNU tar can read them. i guess most "multi-function" archive tools only understand POSIX and GNU tar format.

i don't have a Windows system here to test, but if someone wants to recommend an easy way to extract pax archives under Windows, i'll include it there. maybe we could distribute a standalone version of the Cygwin binary?

...

A quick search also turned up this Windows port of a "libarchive" which seems like it may include a compatible "tar" utility - http://gnuwin32.sourceforge.net/packages/libarchive.htm

this looks fine. i've added a link to here in the readme for now.

kate.

Timwi

3 Jun 3 Jun

4:19 a.m.

Kate Turner wrote:

...

hmm. i completely forgot that people might want to extract the images under non-Unix systems... :-( the pax format is standard, but it's not widely used

Why do you have to use it if it's so poorly supported?

Kate Turner

12:11 p.m.

Timwi wrote in gmane.science.linguistics.wikipedia.technical:

...

Kate Turner wrote:

...

...
the pax format is standard, but it's not widely used

...

Why do you have to use it if it's so poorly supported?

i was unable to find any documentation on the file format that GNU tar uses (see my previous messages to the list). Zip was suggested as an alternative, which is probably the most widely supported archive format. if pax turns out to be too unwieldy, it may be worth using that instead.

kate.

Timwi

3 p.m.

Kate Turner wrote:

...

i was unable to find any documentation on the file format that GNU tar uses

http://www.gnu.org/software/tar/manual/html_mono/tar.html#SEC134 ?

Kate Turner

3:25 p.m.

Timwi wrote in gmane.science.linguistics.wikipedia.technical:

...

Kate Turner wrote:

...

...
i was unable to find any documentation on the file format that GNU tar uses

...

http://www.gnu.org/software/tar/manual/html_mono/tar.html#SEC134 ?

this is what i looked at before, but i can't find the relevant part of the description. it says:

/* Identifies the *next* file on the tape as having a long name. */ #define GNUTYPE_LONGNAME 'L'

but does not indicate how the long name should be encoded in the archive header, unless i'm missing it somewhere...

kate.

(interestingly, the manual says that GNU tar will use pax format by default in the future, although i suppose that does not solve the immediate problem ;-)

7136

Age (days ago)

7138

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

james
Kate Turner
Rowan Collins
Timwi