don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
It's amazing there are already so many years available for download.
Especially the larger zips must have been somewhat time-consuming to
compile! It would be great if 2008 or pre-2006 packages become available in
the near future. It is a really interesting development to see
http://dumps.wikimedia.org being interested in not only compiling large
current packaged databases of various wikis, but also more historic
content. In the past, the Internet Archive was the sole distributor of
older (also historic) wiki packages.
Eventually in the far future, there will have to be some sort of viable
mechanism for cloning all the images stored on wikimedia, though as for now
the Picture of the Year packages are very interesting for those more
interested in the pretty images of wikipedia. The POTY images also make
great wallpaper packs!
I am just working on a wikipedia reader when I noticed this little issue.
The data in the image metadata dumps (e.g.: enwiki-20120403-image.sql.gz) get somewhat truncated.
This appears in the img_description column being defined as tinyblob. Tinyblobs apparently hold 255 bytes, max.
I'd really love to use this dump instead of straining the servers..and taking forever.
Is this my fault or can you do something to address this issue?
Most interesting for me would be commons of course, then the german, french and spanish wikipedias.
Best from Berlin,
Please see the column definition:
img_description` tinyblob NOT NULL
And the table structure:
CREATE TABLE `image` (
`img_name` varbinary(255) NOT NULL DEFAULT '',
`img_size` int(8) unsigned NOT NULL DEFAULT '0',
`img_width` int(5) NOT NULL DEFAULT '0',
`img_height` int(5) NOT NULL DEFAULT '0',
`img_metadata` mediumblob NOT NULL,
`img_bits` int(3) NOT NULL DEFAULT '0',
`img_media_type` enum('UNKNOWN','BITMAP','DRAWING','AUDIO','VIDEO','MULTIMEDIA','OFFICE','TEXT','EXECUTABLE','ARCHIVE') DEFAULT NULL,
`img_major_mime` enum('unknown','application','audio','image','text','video','message','model','multipart') NOT NULL DEFAULT 'unknown',
`img_minor_mime` varbinary(32) NOT NULL DEFAULT 'unknown',
`img_description` tinyblob NOT NULL,
`img_user` int(5) unsigned NOT NULL DEFAULT '0',
`img_user_text` varbinary(255) NOT NULL DEFAULT '',
`img_timestamp` varbinary(14) NOT NULL DEFAULT '',
`img_sha1` varbinary(32) NOT NULL DEFAULT '',
PRIMARY KEY (`img_name`),
KEY `img_size` (`img_size`),
KEY `img_timestamp` (`img_timestamp`),
KEY `img_usertext_timestamp` (`img_user_text`,`img_timestamp`),
KEY `img_sha1` (`img_sha1`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;
Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :)
Sorry for breaking the thread, but I just subscribed, so I think
this'll probably break mailman's threading headers.
This is very exciting news, and IA would love to have a copy! We're
more interested in being a historical mirror (on our item
infrastructure), rather than a live rsync/http/ftp mirror, but perhaps
we can also work something out mirroring the latest dumps. (How big
are the last 2 or so?)
I suppose the next step is for me and Ariel to talk about technical
procedures and details, et cetera, but I just wanted to subscribe to
this ml and introduce myself.
Ariel, when you have a minute to chat, shoot me an email (or skype).
I'm thinking we just pull things at whatever frequency you guys push
out the data to your.org (which may or may not be scheduled yet) and
throw them into new items on the cluster.
Others' thoughts are, of course, always welcome.
Internet Archive, a registered California non-profit library
This is phase one of a plan to make uploaded media from WMF projects
accessible for download in bulk. It, like many other things lately, is
experimental and subject to breakage, change, etc.
First, a big thanks to Kevin Day from Your.org who offered us the space
and worked with us many hours to sort out networking issues, try
different NAS setups, and generally do what was needed to get this
Rsync url: ftpmirror.your.org::wikimedia-images/projectname/languagecode
rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir
would get you all of commons including archived versions (no deleted
images of course).
Folks who are trying to download media for a specific project should
bear in mind that they will need the files not only from that project
but also those which are hosted on commons and used on the local
project. I'm looking into producing lists of those files for easy use
I would suggest rather than everyone downloading a zillion copies of
commons at once, that folks coordinate a little bit, or just get the
pieces they need :-D
The data that is there now is probably about 15-20 days old. It will
likely be a little while before I get the media rsync going on a regular
basis, I'm juggling a lot of pieces right now.
P.S. This is not an April fools joke, it's April 2 here already :-P