just finished downloading the en image dump, tried extracting it with winrar - only to get an error - 'the archive is corupt' - though doing it with a different extractor, not only doesnt preserve the folders, there only seems to be 9200 images in the dump.... could well be more, as i havent extract it all, because of this folder thing
anyone of an extracter that does a folder intact job
once that is extracted - how will my wiki know where the images are for each artical, and thus include them within each artical?
thanks
On 01/06/05, james jamessampford@supanet.com wrote:
just finished downloading the en image dump, tried extracting it with winrar -
anyone of an extracter that does a folder intact job
http://download.wikimedia.org/images/README_ABOUT_FILE_FORMAT.txt mentions how to extract them correctly under *NIX system - so one way would be to get cygwin up and running. A quick search also turned up this Windows port of a "libarchive" which seems like it may include a compatible "tar" utility - http://gnuwin32.sourceforge.net/packages/libarchive.htm
once that is extracted - how will my wiki know where the images are for each artical, and thus include them within each artical?
For that, you need to download the "image" and "imagelinks" database tables from http://download.wikimedia.org/#en.wikipedia to go with your "cur" dump (the "imagelinks" one could probably be rebuilt programmatically, but that's likely to be slower than just downloading it).
Rowan Collins wrote in gmane.science.linguistics.wikipedia.technical:
On 01/06/05, james jamessampford@supanet.com wrote:
just finished downloading the en image dump, tried extracting it with winrar -
http://download.wikimedia.org/images/README_ABOUT_FILE_FORMAT.txt mentions how to extract them correctly under *NIX system - so one way would be to get cygwin up and running.
hmm. i completely forgot that people might want to extract the images under non-Unix systems... :-( the pax format is standard, but it's not widely used - it's not the same as GNU's version, although GNU tar can read them. i guess most "multi-function" archive tools only understand POSIX and GNU tar format.
i don't have a Windows system here to test, but if someone wants to recommend an easy way to extract pax archives under Windows, i'll include it there. maybe we could distribute a standalone version of the Cygwin binary?
A quick search also turned up this Windows port of a "libarchive" which seems like it may include a compatible "tar" utility - http://gnuwin32.sourceforge.net/packages/libarchive.htm
this looks fine. i've added a link to here in the readme for now.
kate.
Timwi wrote in gmane.science.linguistics.wikipedia.technical:
Kate Turner wrote:
the pax format is standard, but it's not widely used
Why do you have to use it if it's so poorly supported?
i was unable to find any documentation on the file format that GNU tar uses (see my previous messages to the list). Zip was suggested as an alternative, which is probably the most widely supported archive format. if pax turns out to be too unwieldy, it may be worth using that instead.
kate.
Kate Turner wrote:
i was unable to find any documentation on the file format that GNU tar uses
http://www.gnu.org/software/tar/manual/html_mono/tar.html#SEC134 ?
Timwi wrote in gmane.science.linguistics.wikipedia.technical:
Kate Turner wrote:
i was unable to find any documentation on the file format that GNU tar uses
http://www.gnu.org/software/tar/manual/html_mono/tar.html#SEC134 ?
this is what i looked at before, but i can't find the relevant part of the description. it says:
/* Identifies the *next* file on the tape as having a long name. */ #define GNUTYPE_LONGNAME 'L'
but does not indicate how the long name should be encoded in the archive header, unless i'm missing it somewhere...
kate.
(interestingly, the manual says that GNU tar will use pax format by default in the future, although i suppose that does not solve the immediate problem ;-)
wikitech-l@lists.wikimedia.org