Xmldatadumps-l December 2015

xmldatadumps-l@lists.wikimedia.org

7 participants
5 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

corrupted files english december

by Luigi Assom

Hello Wikiteam! just in time to say have a good vacation and happy 2016 :) well I am also here about corrupted files :) I downloaded three times from different wifi networks and using download managers from Firefox: enwiki-20151201-pages-articles-multistream.xml.bz2 and two times: enwiki-latest-pages-articles-multistream.xml.bz2 MD5 checksum is correct (*-latest-* have checksum of -*20151201-*) but file is corrupted. Cannot use bzip2recover for file is too large and I should recompile it because it will have more than maxlimit of generated tokens... and I think it is way better to take a fixed file :D Could you please check if I am the only one having this issue? dumps from other languages had worked fine for me, en-* is problematic.

8 years, 3 months

dump problem?

by Bryan White

I see where almost all the dumps have "Dump complete" next to them and the data has been transferred to labs. Problem is, the dumps are not complete. Is this the new paradigm?... After each stage of the dump, label them done and then transfer what files were generated? Wash, rinse and repeat? Bryan

8 years, 4 months

Request for input on de fr wiki breaking change to dumps!

by Ariel T. Glenn

I assume you have all seen https://phabricator.wikimedia.org/T116907 "Explore the possibility of splitting dewiki and frwiki into smaller chunks" If not, and you ever use frwiki or dewiki page content dumps, go read it now. Or if you know of anyone who uses them, please nag them to go read it. The upshot is that we will most likely on January 1st 2016 do all further dump runs of frwiki and dewiki with so-called 'checkpointing'. This change is being made so that if one of these jobs is interrupted for whatever reason, it can be rerun with only the missing page ranges dumped on the second run, saving quite a lot of time. A second reason is to ease the burden on downloaders, who generally prefer downloading several smaller files rather than one large 90gb file (example taken from dewiki history dumps). WHat does this mean in practice for you, users of the dumps? It means that filenames for the page content (article, meta-current and meta -history) dumps will have pXXXXpYYYY in the names, where XXXX is the first page id in the file and YYY is the last pageid in the file. For examples of this you can look at the enwiki page content dumps, which have been running that way for a few years now. This notice should give you plenty of time to convert your tools to use the new nameing scheme. I encourage you to forward this message to other appropriate people or groups. Thanks, Ariel

8 years, 4 months

Dumps 2.0 Wiki Dev Summit but also...

by Ariel T. Glenn

Also if you are a dumps user or have thoughts about how you would redo them from scratch, get your ideas in now. We're not waiting for the Dev Summit to get the work started. See https://phabricator.wikimedia.org/T114019 for details, especially the document linked at the end of the task description under 'FOR CURRENT DISCUSSION'. We need: comments on the strawman proposed model in the document mentioned above; proposals for code we can reuse for any of those pieces, especially the job queue/management piece. (Celery? Something else?) What should the object store be based on, if we have one? Is Ceph a dead end? Will Swift be deadly slow or should we use it since we have it in house already? Etc. All comments on the ticket please so we have them all in one place. Expect that most of the code written here at WMF will be in python unless someone else volunteers to write some. Anyone interested in doing some of the development can step right up too of course. Please forward this on to other fora as appropriate. Ariel

8 years, 4 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l December 2015