Xmldatadumps-l January 2014

xmldatadumps-l@lists.wikimedia.org

6 participants
6 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Re: [Xmldatadumps-l] [Wikitech-l] Compressing full-history dumps faster

by Randall Farmer

Ack, sorry for the (no subject); again in the right thread: > For external uses like XML dumps integrating the compression > strategy into LZMA would however be very attractive. This would also > benefit other users of LZMA compression like HBase. For dumps or other uses, 7za -mx=3 / xz -3 is your best bet. That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark). Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/ Thanks for the reply, Randall

10 years

Re: [Xmldatadumps-l] (no subject)

by Randall Farmer

> For external uses like XML dumps integrating the compression > strategy into LZMA would however be very attractive. This would also > benefit other users of LZMA compression like HBase. For dumps or other uses, 7za -mx=3 / xz -3 is your best bet. That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark). Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/ Thanks for the reply, Randall On Tue, Jan 21, 2014 at 2:19 PM, Randall Farmer <randall(a)wawd.com> wrote: > > For external uses like XML dumps integrating the compression > > strategy into LZMA would however be very attractive. This would also > > benefit other users of LZMA compression like HBase. > > For dumps or other uses, 7za -mx=3 / xz -3 is your best bet. > > That has a 4 MB buffer, compression ratios within 15-25% of > current 7zip (or histzip), and goes at 30MB/s on my box, > which is still 8x faster than the status quo (going by a 1GB > benchmark). > > Re: trying to get long-range matching into LZMA, first, I > couldn't confidently hack on liblzma. Second, Igor might > not want to do anything as niche-specific as this (but who > knows!). Third, even with a faster matching strategy, the > LZMA *format* seems to require some intricate stuff (range > coding) that be a blocker to getting the ideal speeds > (honestly not sure). > > In any case, I left a note on the 7-Zip boards as folks have > suggested: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/ > > Thanks for the reply, > Randall > >

10 years, 3 months

Compressing full-history dumps faster

by Randall Farmer

Hi, everyone. tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the same avg compression ratio as 7zip. Can anyone help me test more or experimentally deploy? As I understand, compressing full-history dumps for English Wikipedia and other big wikis takes a lot of resources: enwiki history is about 10TB unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's over a day of server time. There's been talk about ways to speed that up in the past.[1] It turns out that for history dumps in particular, you can compress many times faster if you do a first pass that just trims the long chunks of text that didn't change between revisions. A program called rzip[2] does this (and rzip's _very_ cool, but fatally for us it can't stream input or output). The general approach is sometimes called Bentley-McIlroy compression.[3] So I wrote something I'm calling histzip.[4] It compresses long repeated sections using a history buffer of a few MB. If you pipe history XML through histzip to bzip2, the whole process can go ~100 MB/s/core, so we're talking an hour or three to pack enwiki on a big box. While it compresses, it also self-tests by unpacking its output and comparing checksums against the original. I've done a couple test runs on last month's fullhist dumps without checksum errors or crashes. Last full run I did, the whole dump compressed to about 1% smaller than 7zip's output; the exact ratios varied file to file (I think it's relatively better at pages with many revisions) but were +/- 10% of 7zip's in general. Also, less exciting, but histzip's also a reasonably cheap way to get daily incr dumps about 30% smaller. Technical datadaump aside: *How could I get this more thoroughly tested, then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.) Am I nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility. Best, Randall [1] Some past discussion/experiments are captured at http://www.mediawiki.org/wiki/Dbzip2, and some old scripts I wrote are at https://git.wikimedia.org/commit/operations%2Fdumps/11e9b23b4bc76bf3d89e1fb… [2] http://rzip.samba.org/ [3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&t… [4] https://github.com/twotwotwo/histzip

10 years, 3 months

Setting up a local Wikipedia mirror with extensions

by Bastian koell

Hi, I am trying to set up a local Wikipedia mirror. While I was reading up on how to import xml dumps and install extensions manually - I find it hard to match and install all the extensions manually and properly. I have been testing this with the Simple Wikipedia. What is the best and easiest way to install a local mirror? Best, Bastian

10 years, 3 months

Preparing for wp-mirror-0.7

by wp mirror

Dear Ariel, Happy New Year. I am gearing up for wp-mirror-0.7. To that end, I would like to list some issues that I see; and I would like to offer my help in solving them. 0) Problem Statements 0.1) Page Rendering. Wp-mirror-0.6 works well in the sense that it builds a faithful mirror of any of your wikis. However, during 2013 the rendering of pages eroded materially. For example, o interlanguage links have vanished both from rendered pages and from dump files; o infoboxes are no longer rendered; o most transclusions now render as redlinks even though the templates are easily found in the underlying database; etc. I understand that this erosion occurred because wp-mirror-0.6 still uses mediawiki-1.19, whereas WMF has moved on to mediawiki-1.23. For example, I understand that: o interlanguage links have been removed to the wikidata project, the rendering of which requires mediawiki-1.21+; o infoboxes now require the scribunto extension which requires mediawiki-1.20+ 0.2) Database Schema. Some differences in database schema have appeared. o category - dump files now have 5 fields, whereas the database schema has 6 fields; o exterallinks - dump files now have 4 fields, whereas the database schema has 3 fields. Loading these two tables generate the error message: ``Column count doesn't match value at row 1.'' 0.3) Version Lifecycle. According to < http://www.mediawiki.org/wiki/Version_lifecycle> mediawiki 1.23 LTS is slated for May 2014. However, the Debian packaging team is silent as to their plans for a transition from mediawiki-1.19 LTS to mediawiki-1.23 LTS. 0.4) Image Dumps. The large image dump tarballs are now a year old. This means that, while wp-mirror still downloads the bulk of its images from these tarballs, there are a growing number that must be downloaded individually from WMF. 0.5) Thumbs. One person has asked me if dump files of thumbs could be made available. We are beginning to see thumb dumps from the xowa project. 0.6) IPv6. I am glad to see that <gerrit.wikimedia.org> has an IPv6 address. However, <bastion.wmflabs.org> still does not. My internal network is IPv6 only. 1) mwxml2sql This utility from Ariel Glenn has proved invaluable to the wp-mirror project. This utility, together with MySQL 5.5 fast index creation, allows wp-mirror to build mirrors much faster than before (80% less time). 1.1) Need for update. According to its version information, mwxml2sql may only be valid through mediawiki-1.21. (shell)$ mwxml2sql --version mwxml2sql 0.0.2 Supported input schema versions: 0.4 through 0.8. Supported output MediaWiki versions: 1.5 through 1.21. Whereas, I am looking forward to mediawiki-1.23 LTS (see below), I would like to know if mwxml2sql should be updated. 1.2) Help Offer. If mwxml2sql does need updating, I would be happy to help with this; and to package it for Debian as I have done before. Perhaps we could call it mwxml2sql-0.0.3. 2) mediawiki-1.23 LTS. 2.1) Vision. I would like wp-mirror-0.7 to be able to build a mirror that serves pages that look no different than those served by WMF. 2.2) DEB package. To that end, I am thinking of packaging mediawiki-1.23 together with the extensions needed for rendering WMF wikis with wikidata content, infoboxes, math, transclusions, etc. Given WMF's ``continuous integration'' development model, I would like to be able to automatically generate a tarball and DEB package each time WMF pushes an update to its servers. 2.3) Debian package repository. Such a DEB package would be distributed with wp-mirror. In preparation for this, I have set up a Debian package repository at <http://download.savannah.gnu.org/releases/wp-mirror/>. It is currently used to distribute wp-mirror-0.6 and an unstable version of wp-mirror-0.7. Home page <http://www.nongnu.org/wp-mirror/>. 2.4) Help Offer. I am happy to do most of this work myself. However, I will need some guidance on interacting with the appropriate GIT repositories. I hope that you can put me in touch with someone involved in the ``continuous integration'' process. 3) Media dumps I am thinking that updating the image dumps annually would be adequate. Including thumbs in those dumps would materially assist the off-line community. I could easily update wp-mirror-0.7 to give the user a choice (no media files, thumbs only, full size media files). Sincerely Yours, Kent

10 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l January 2014