Xmldatadumps-l April 2012

xmldatadumps-l@lists.wikimedia.org

14 participants
10 discussions

pbzip2 proposal
by Richard Jelinek 16 Jan '16

16 Jan '16

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

6 11

Re: [Xmldatadumps-l] things look ok after 1.19, so...
by Andreas Meier 15 May '12

15 May '12

Hello, there seems to be only two worker processes for bigger wikis instead of three as before. Best regards Andreas

2 5

Picture of the year
by burslem 08 May '12

08 May '12

It's amazing there are already so many years available for download. Especially the larger zips must have been somewhat time-consuming to compile! It would be great if 2008 or pre-2006 packages become available in the near future. It is a really interesting development to see http://dumps.wikimedia.org being interested in not only compiling large current packaged databases of various wikis, but also more historic content. In the past, the Internet Archive was the sole distributor of older (also historic) wiki packages. Eventually in the far future, there will have to be some sort of viable mechanism for cloning all the images stored on wikimedia, though as for now the Picture of the Year packages are very interesting for those more interested in the pretty images of wikipedia. The POTY images also make great wallpaper packs!

4 3

…-image.sql.gz metadata dumps get truncated
by Bastian Koell 07 May '12

07 May '12

Hello everyone, I am just working on a wikipedia reader when I noticed this little issue. The data in the image metadata dumps (e.g.: enwiki-20120403-image.sql.gz) get somewhat truncated. This appears in the img_description column being defined as tinyblob. Tinyblobs apparently hold 255 bytes, max. I'd really love to use this dump instead of straining the servers..and taking forever. Is this my fault or can you do something to address this issue? Most interesting for me would be commons of course, then the german, french and spanish wikipedias. Best from Berlin, Bastian Please see the column definition: img_description` tinyblob NOT NULL And the table structure: CREATE TABLE `image` ( `img_name` varbinary(255) NOT NULL DEFAULT '', `img_size` int(8) unsigned NOT NULL DEFAULT '0', `img_width` int(5) NOT NULL DEFAULT '0', `img_height` int(5) NOT NULL DEFAULT '0', `img_metadata` mediumblob NOT NULL, `img_bits` int(3) NOT NULL DEFAULT '0', `img_media_type` enum('UNKNOWN','BITMAP','DRAWING','AUDIO','VIDEO','MULTIMEDIA','OFFICE','TEXT','EXECUTABLE','ARCHIVE') DEFAULT NULL, `img_major_mime` enum('unknown','application','audio','image','text','video','message','model','multipart') NOT NULL DEFAULT 'unknown', `img_minor_mime` varbinary(32) NOT NULL DEFAULT 'unknown', `img_description` tinyblob NOT NULL, `img_user` int(5) unsigned NOT NULL DEFAULT '0', `img_user_text` varbinary(255) NOT NULL DEFAULT '', `img_timestamp` varbinary(14) NOT NULL DEFAULT '', `img_sha1` varbinary(32) NOT NULL DEFAULT '', PRIMARY KEY (`img_name`), KEY `img_size` (`img_size`), KEY `img_timestamp` (`img_timestamp`), KEY `img_usertext_timestamp` (`img_user_text`,`img_timestamp`), KEY `img_sha1` (`img_sha1`) ) ENGINE=InnoDB DEFAULT CHARSET=binary;

4 4

Old dump for Wikipedia (August 8th, 2008)
by Nicolai Erbs 27 Apr '12

27 Apr '12

Dear all, I'm looking for an old Wikipedia dump (August 8th, 2008). Any ideas where I can get it? Thanks in advance! Nicolai

5 8

Question about enwiki pages-meta-history splits
by Napolitano, Diane 25 Apr '12

25 Apr '12

Hello, I was wondering how the decision is reached to split enwiki pages-meta-history into, say, N XML files. How is N determined? Is it based on something like "let's try to have X many pages per XML file" or "Y many revisions per XML file" or trying to keep the size (GB) of each XML file roughly equivalent? Or is N just an arbitrary number chosen because it sounds nice? :) Thanks, Diane

2 4

issues on wikis running 1.20wmf1
by Ariel T. Glenn 20 Apr '12

20 Apr '12

I'm seeing some failures to retrieve certain revision texts for wikis now running MediaWiki 1.20wmf1. The problem is being investigated. Ariel

1 1

Re: [Xmldatadumps-l] uploaded media for WMF projects available via rsync
by Alex Buie 06 Apr '12

06 Apr '12

Hey guys, Sorry for breaking the thread, but I just subscribed, so I think this'll probably break mailman's threading headers. This is very exciting news, and IA would love to have a copy! We're more interested in being a historical mirror (on our item infrastructure), rather than a live rsync/http/ftp mirror, but perhaps we can also work something out mirroring the latest dumps. (How big are the last 2 or so?) I suppose the next step is for me and Ariel to talk about technical procedures and details, et cetera, but I just wanted to subscribe to this ml and introduce myself. Ariel, when you have a minute to chat, shoot me an email (or skype). I'm thinking we just pull things at whatever frequency you guys push out the data to your.org (which may or may not be scheduled yet) and throw them into new items on the cluster. Others' thoughts are, of course, always welcome. Thanks! Alex Buie Collections Group Internet Archive, a registered California non-profit library abuie(a)archive.org

6 13

Re: [Xmldatadumps-l] Fwd: I DID IT!!!
by andreasmeier80＠gmx.de 03 Apr '12

03 Apr '12

Online success awaits! Discover how you can earn mega profits http://mgolebatmaz.av.tr/currentevents/54GaryWallace/ ___________________________________________________________ Neu: Geschenkt! 50% mehr Speicher für Ihr Freemail-Postfach! Jetzt informieren: https://service.gmx.net/de/cgi/g.fcgi/products/mailcheck

1 0

uploaded media for WMF projects available via rsync
by Ariel T. Glenn 02 Apr '12

02 Apr '12

This is phase one of a plan to make uploaded media from WMF projects accessible for download in bulk. It, like many other things lately, is experimental and subject to breakage, change, etc. First, a big thanks to Kevin Day from Your.org who offered us the space and worked with us many hours to sort out networking issues, try different NAS setups, and generally do what was needed to get this going. Rsync url: ftpmirror.your.org::wikimedia-images/projectname/languagecode For example: rsync -a ftpmirror.yours.org::wikimedia-images/wikipedia/commons /my/dir would get you all of commons including archived versions (no deleted images of course). Folks who are trying to download media for a specific project should bear in mind that they will need the files not only from that project but also those which are hosted on commons and used on the local project. I'm looking into producing lists of those files for easy use by rsyncers. I would suggest rather than everyone downloading a zillion copies of commons at once, that folks coordinate a little bit, or just get the pieces they need :-D The data that is there now is probably about 15-20 days old. It will likely be a little while before I get the media rsync going on a regular basis, I'm juggling a lot of pieces right now. Ariel P.S. This is not an April fools joke, it's April 2 here already :-P

3 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l April 2012