Xmldatadumps-l March 2012

xmldatadumps-l@lists.wikimedia.org

12 participants
8 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Re: [Xmldatadumps-l] things look ok after 1.19, so...

by Andreas Meier

Hello, there seems to be only two worker processes for bigger wikis instead of three as before. Best regards Andreas

11 years, 11 months

Picture of the year

by burslem

It's amazing there are already so many years available for download. Especially the larger zips must have been somewhat time-consuming to compile! It would be great if 2008 or pre-2006 packages become available in the near future. It is a really interesting development to see http://dumps.wikimedia.org being interested in not only compiling large current packaged databases of various wikis, but also more historic content. In the past, the Internet Archive was the sole distributor of older (also historic) wiki packages. Eventually in the far future, there will have to be some sort of viable mechanism for cloning all the images stored on wikimedia, though as for now the Picture of the Year packages are very interesting for those more interested in the pretty images of wikipedia. The POTY images also make great wallpaper packs!

11 years, 11 months

claning up deployment procedures

by Ariel T. Glenn

I'm doing a little bit of work on deployment procedures for the dump scripts as I push out a few small bug fixes and turn on logging. Over the next day or so you'll notice interruptions or delays while the conversion is happening. Ariel

12 years, 1 month

Geo Tags?

by toni hernández

Hi all, I have beem loooking at the wikipedia database scheme and I haven't found any field that suggest that some contents are geographical located. Am I wrong? If it is possible I would like to download the geographical located contents of Wikipedia to do something similar to what googleearth does with the wikipedia layer Is that possible? Thanks in advanced.

12 years, 1 month

a panel for wikimania that some folks should do

by Ariel T. Glenn

I can't go but some people on this list should think about a panel that disusses forkability, archival of content and other related things. In case this sounds attractive to someone that is planning to go, deadline for submission is in a week! http://wikimania2012.wikimedia.org/wiki/Submissions I'm willing to have my brain picked by anyone who decides this is worth doing, in case that's helpful. Ariel

12 years, 1 month

things look ok after 1.19, so...

by Ariel T. Glenn

I've cranked up all the usual workers and kicked off an en wp run in addition. If there is anything that squeaked by my spot checks we'll know about it soon... Ariel

12 years, 1 month

Accessing Wikipedia/Wikitravel dumps

by Ashish Mukherjee

Hi, I would like to access the Wiki dumps for Wikipedia and Wikitravel. Essentially, I am looking to get dumps for some US cities from both these sources for product research. However, it is not very clear at http://dumps.wikimedia.org/backup-index.html, which are the relevant files to pick up. The other question I had was at what level of atomicity data is available in dumps. The Web Service allows us to retrieve a Wiki entry but it's not easily parsed out into different sections or in more granular form. I was wondering if the dumps solve this problem. Appreciate any help about this. Regards, Ashish

12 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l March 2012