Xmldatadumps-l September 2015

xmldatadumps-l@lists.wikimedia.org

4 participants
2 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Use of dumps in mediawiki

by Yoni Lamri

Hello everybody, First thread in the list, i'm relatively new to wikipedia/media usage (1 month). I followed the doc (RTFM as usual) and every steps seems to finish in a wall. My simple question, how to correctly install a wikipedia mirror from dumps in mediawiki ? My goal: Create an offline wikimedia server, from FR, EN or PT dumps, (1 language only). I did some tests with the wowiki which is small enough for testing purpose but I encountered problems: The tools mentioned in the doc are 404 for most of them or outdated. for example the xml2sql or mwdumper. the original importer of wikimedia takes 20 minutes to import the wowiki 1.6Mo dump which means approximatly a decade for the FR one... I tried with the last MWDumper found on github, it can quickly generate a good sql file but it seems that special characters in page text are not "slashed" so \n return SQL error... Some columns are missing in original mediawiki installer, the "page_counter" in "page" table for example, is there some needed extensions to install dumps? How can i found/get medias (images low res are enough) How to do with interwiki medias/links/pages? To follow next step will be updates procedure, incremental or complete... Seems french wikimedia team are stuck with my questions.. I'm not closed to development (python, PHP or JEE) but i need an entry point to begin if nobody work on it... Thank you for your work, i know you are quite busy with the dump generation but our project is quite serious and lots of people want it and we need quick answers for the feasability. Best regards, Yoni

8 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l September 2015