Xmldatadumps-l February 2012

xmldatadumps-l@lists.wikimedia.org

6 participants
7 discussions

pbzip2 proposal
by Richard Jelinek 16 Jan '16

16 Jan '16

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

6 11

Picture of the year
by burslem 08 May '12

08 May '12

It's amazing there are already so many years available for download. Especially the larger zips must have been somewhat time-consuming to compile! It would be great if 2008 or pre-2006 packages become available in the near future. It is a really interesting development to see http://dumps.wikimedia.org being interested in not only compiling large current packaged databases of various wikis, but also more historic content. In the past, the Internet Archive was the sole distributor of older (also historic) wiki packages. Eventually in the far future, there will have to be some sort of viable mechanism for cloning all the images stored on wikimedia, though as for now the Picture of the Year packages are very interesting for those more interested in the pretty images of wikipedia. The POTY images also make great wallpaper packs!

4 3

post-1.19 deployment
by Ariel T. Glenn 24 Feb '12

24 Feb '12

As usual there have been a couple of hiccups after the rollout of mw 1.19wmf1 everywhere. In the expctation that there may be other surprises, I'm running just one process over the weekend and will check on it from time to time. Of course if anyone notices something awry, please bring it up. Thanks, Ariel

1 0

dewiki dump 20120221
by Andreas Meier 24 Feb '12

24 Feb '12

Hello, the dewiki dump 20120221 is complettly broken, see http://dumps.wikimedia.org/dewiki/20120221/ . In my opinion the best thing to do is to stop this dump process and make a new dump. I think it makes no sense to restart the 20120221 dump. Best regards Andreas

2 1

Http external storage
by Christian Aistleitner 09 Feb '12

09 Feb '12

Hello, while working towards the xmldumps-backup test suite, I came across ExternalStorageHttp. While it's easy to setup (read-only), testing it is rather a burden. Is ExternalStorageHttp actually used for some MediaWiki at WikiMedia? Kind regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

old_flags column of MediaWiki's text table
by Christian Aistleitner 08 Feb '12

08 Feb '12

Hello, I am currently developing a test suite for the XML dumps, and I am curious about the specification of text.old_flags in MediaWiki's maintainance/tables.sql. The file describes the 'object' flag as text field contained a serialized PHP object. object either contains multiple versions compressed to achieve a better compression ratio, or it refers to another row where the text can be found. Is the „multiple versions” part still used in some project? If so, how should this be set up [1]? Kind regards, Christian P.S.: In #wikimedia-dev I was told, to bring up the question on this list. If there are further lists, where I should ask, please let me know. [1] Before r6138 (back then still in Article.php not Revision.php), it seems the text was obtained by $object = unserialize( $text ); $text = $object->getItem( $hash ); . There it is somewhat obvious how a single object may return different texts. However, beginning with p6138 it seems the text is simply fetched by $obj = unserialize( $text ); [...] $text = $obj->getText(); . If a single object should return different texts, how does it determine, which text to return? -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 2

POTY 2009 plus a few historical dumps now available
by Ariel T. Glenn 02 Feb '12

02 Feb '12

Good morning campers! :-) At the POTY collection link http://dumps.wikimedia.org/other/poty/ you'll notice that the 2009 files have been added. For people who have wanted older (2002 through 2006) dumps, one or two dumps of the projects for most of that period are now online at our new archive link: http://dumps.wikimedia.org/archive/ I'd love to get a couple dumps of each of the projects for the years 2005, 2007 and 2008. If you have an old mirror on a hard drive gathering dust in your closet, give me a shout. Thanks! Hmm, and one other thing while I'm at it, I've been cleaning up our dumps documentation on wikitech, and there's now a rough outline of the dumps history here: http://wikitech.wikimedia.org/view/Dumps/History If people remember any milestones that should be added there, or if you see any glaring errors, please edit, or send me mail with the corrections. Ariel

2 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l February 2012