Xmldatadumps-l October 2015

xmldatadumps-l@lists.wikimedia.org

7 participants
5 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Wikimedia Dev Summit 2016 dumps 2.0 redesign session

by Ariel T. Glenn

Planning for the full redesign to kick off early 2016. When I say "full redesign" I mean it, so let's not talk small stuff about whether json or xml is a batter format; let's talk about a design that will allow us to plug in various formats easily, for example. Starting point for discussion here: https://phabricator.wikimedia.org/T114019 See the last link on the description. What we can start on now: are these the right components to break the dumps into? Can we indeed cover all sorts of dumps this way? What existing software can we re-use? WHat sort of grid computing or other package can we use for the "black box" in the diagram? Etc. etc. We want to get as much of this hashed out ahead of time so that we can get the maximum out of the session. Ariel

8 years, 5 months

Still no dumps available.

by Bryan White

Dumps are still not available on labs after two weeks. No one has said anything on why they are down or if they will be fixed. Dumps are still not being run twice a month. Bryan

8 years, 6 months

Non-English dump problems.

by Chris Newell

Hi, Although the English language dump for October has completed successfully there are problems with some other languages - the Spanish, German and Russian dumps dated 20151002 failed at: "Articles, templates, media/file descriptions, and primary meta-pages". Could someone could look into this? Thanks, Chris ------------------- Chris Newell Lead Technologist Internet Research & Future Services BBC Research & Development

8 years, 6 months

Incremental dump generation is stopped

by Ivan

Hello, Daily dump generator did not create any dumps during last two days. Directories: http://dumps.wikimedia.org/other/incr/wikidatawiki/20150930/ http://dumps.wikimedia.org/other/incr/wikidatawiki/20151001/ do not contain any dumps. Is it known issue? Sincerely, Ivan

8 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l October 2015