pbzip2 proposal

List overview All Threads
Download

newer

older

Re: [Xmldatadumps-l] [Wikitech-l]...

2016-01 dumps halted?

Richard Jelinek

28 Jan 2012 28 Jan '12

12:38 a.m.

Hi,

don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't...

I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-)

Strange? Read on.

A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that.

The results indicate that:

bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2.

I propose compressing the archives with pbzip2 for the following reasons:

1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression).

2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual.

3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host.

So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-)

cheers,

-- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

Show replies by date

Federico Leva (Nemo)

28 Jan 28 Jan

8:34 a.m.

Richard Jelinek, 28/01/2012 00:38:

...

don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't...

There's a quite old comparison here: https://www.mediawiki.org/wiki/Dbzip2 but https://wikitech.wikimedia.org/view/Dumps/Parallelization suggests it's still relevant.

Nemo

Ariel T. Glenn

8:56 a.m.

Στις 28-01-2012, ημέρα Σαβ, και ώρα 08:34 +0100, ο/η Federico Leva (Nemo) έγραψε:

...

Richard Jelinek, 28/01/2012 00:38:

...
don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't...

There's a quite old comparison here: https://www.mediawiki.org/wiki/Dbzip2 but https://wikitech.wikimedia.org/view/Dumps/Parallelization suggests it's still relevant.

I need to revisit this issue and look at it carefully at some point. I will say for now that by running multiple workers on one host, we already make use of multiple cpus for the small and medium wikis. For en wikipedia, we produce multiple pieces at once for each phase that matters; this includes production of the gzipped stub files, so once again for that case we are making the maximum use of our cpus and memory on the host that runs those.

It's possible that we could recombine the enwiki dumps into a single file by using pbzip2, this being where compression using one cpu would slow us down, but right now we just skip that step. People have been fine with using the smaller files and in fact seem to prefer them.

The other thing about switching from one bzip2 implementation to another is that I rely on some specific properties of the bzip2 output (and its library) for integrity checking and for locating blocks in the middle of a dump when needed. I'd need to make sure my hacks still worked with the new output.

Ariel

Richard Jelinek

9:51 a.m.

On Sat, Jan 28, 2012 at 09:56:13AM +0200, Ariel T. Glenn wrote:

...

The other thing about switching from one bzip2 implementation to another is that I rely on some specific properties of the bzip2 output (and its library) for integrity checking and for locating blocks in the middle of a dump when needed. I'd need to make sure my hacks still worked with the new output.

Sure. Integrity and compatibility should have topmost priority. I didn't take libbzip2 compatibility into account 1st. Maybe a viable way for us would be some way to detect reliably (any idea?) what has been used to compress the archive and to use bunzip2/pbunzip2 depending on that.

Parallel unpacking on the right format gives us a 4-5x speedup, whereas on the regular bzip2 archive there is no speedup, but ~6x CPU waste.

I had a look at the links sent with the packer comparison. It seems, that dbzip2 development is kind of dormant since 2008 and pbzip2 seems to be actively developed/maintained - maybe the pbzip2 devs would have an open ear for our wishlists? ;-)

regards,

Ariel T. Glenn

9:56 a.m.

Στις 28-01-2012, ημέρα Σαβ, και ώρα 09:51 +0100, ο/η Richard Jelinek έγραψε:

...

Parallel unpacking on the right format gives us a 4-5x speedup, whereas on the regular bzip2 archive there is no speedup, but ~6x CPU waste.

Sure, but that assumes you are not already using those other cores for something else. In our case, we are ;-)

Ariel

Richard Jelinek

11:53 a.m.

On Sat, Jan 28, 2012 at 10:56:27AM +0200, Ariel T. Glenn wrote:

...

Sure, but that assumes you are not already using those other cores for something else. In our case, we are ;-)

So do we (the Perl scripts use e.g. http://search.cpan.org/~dlux/Parallel-ForkManager/, so we have several bunzip2 processes). Despite SSDs, RAID array etc. the machines still have an abundance of CPU power, so naturally I would like to throw CPU time at some problems instead of seeing 2-4% CPU and diskwaits. :-)

pbunzip2 allows me to do that under certain circumstnaces (uncompressing pbzip2 packed archive). I can read in the whole archive to memory, perform a parallel decompression and then shove it to the disks en bulk.

I have no problem in continuing with the status quo (several processes on SMP), but I still see a CPU load of just 50% on average. Despite hyperthreading, of the 16 cores (8 phys + 8 virt), only ~4 physical CPUs are under load. I guess the machine will get some VMs to host then. :-)

Situation changes if we repack the archives with pbzip2, then we get during unpack a load of 6 to 6,5 phys CPUs per machine and 4x speedup. But repacking is void in everyday use, as we inspect (unpack) every wiki archive just once. We did the repacking just for evaluation purposes.

Ok, keep the status as is, but my plea is to think in future about the growing number of CPU cores in the machines of the users of your dumps. ;-)

regards,

Platonides

11:35 p.m.

On 28/01/12 00:38, Richard Jelinek wrote:

...

So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-)

Note that pbzip2 files are usually larger. And with our dump sizes, a small percentage increase could be "a lot". :)

Richard Jelinek

29 Jan 29 Jan

12:11 a.m.

On Sat, Jan 28, 2012 at 11:35:56PM +0100, Platonides wrote:

...

On 28/01/12 00:38, Richard Jelinek wrote:

...
So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-)

Note that pbzip2 files are usually larger. And with our dump sizes, a small percentage increase could be "a lot". :)

Strange, I thought I am the pedantic one. ;-)

For the german wikipedia archive as of 2012-01-16 it's

2556353564 (bzip2) vs. 2557360903 (pbzip2)

that seems to be 0,039% more, I would call that a small percentage.

regards,

Richard Farmbrough

16 Jan 16 Jan

12:12 a.m.

The big advantage with pbzip2 (if it's the program I've been looking at) is the unzip speed. I can't keep everything unzipped, and some utilities can't read bz2 files. Every time I unzip pages-articles I am into a long wait - before I get an error...

LZMA also offers a much improved unzip at the cost of a little more zipping time. Since many unzips occur and only one zip this seems like a good deal.

On 28/01/2012 23:11, Richard Jelinek wrote:

...

On Sat, Jan 28, 2012 at 11:35:56PM +0100, Platonides wrote:

...
On 28/01/12 00:38, Richard Jelinek wrote:

...
So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-)

Note that pbzip2 files are usually larger. And with our dump sizes, a small percentage increase could be "a lot". :)

Strange, I thought I am the pedantic one. ;-)

For the german wikipedia archive as of 2012-01-16 it's

2556353564 (bzip2) vs. 2557360903 (pbzip2)

that seems to be 0,039% more, I would call that a small percentage.

regards,

Richard Jelinek

12:17 p.m.

On Fri, Jan 15, 2016 at 11:12:36PM +0000, Richard Farmbrough wrote:

...

The big advantage with pbzip2 (if it's the program I've been looking at) is the unzip speed. I can't keep everything unzipped, and some utilities can't read bz2 files. Every time I unzip pages-articles I am into a long wait - before I get an error...

Problem with pbzip is its "bad proliferation out there". It's not part of default installs and some distributions like the notorious Debian constantly ship old and buggy versions (1.1.8 in wheezy, 1.1.9 in jessie).

...

LZMA also offers a much improved unzip at the cost of a little more zipping time. Since many unzips occur and only one zip this seems like a good deal.

The world has moved on since my proposal in 2012 and maybe a look at https://github.com/Cyan4973/lz4 would be in order.

regards,

-- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Language Technology - We Mean IT! Sitz der Gesellschaft: Fürth 2.58921 * 10^8 Mind Units Registergericht: AG Fürth, HRB-9201

Richard Jelinek

12:48 p.m.

New subject: LZ4 benchmark; was: pbzip2 proposal

On Sat, Jan 16, 2016 at 12:17:20PM +0100, Richard Jelinek wrote:

...

The world has moved on since my proposal in 2012 and maybe a look at https://github.com/Cyan4973/lz4 would be in order.

The compression ratio is bad

# lz4 -9 ces-20160111.xml ces-20160111.xml.lz4 Compressed 2343742714 bytes into 818397719 bytes ==> 34.92%

compared to 557039914 bytes of bzip2

-rw-r--r-- 1 root root 557039914 Jan 13 08:48 ces-20160111.xml.bz2 -rw-r--r-- 1 root root 818397719 Jan 16 12:31 ces-20160111.xml.lz4

however, the decompression is a speedup of 25 give or take (on our machine):

# time bunzip2 -k ces-20160111.xml.bz2

real 1m47.853s user 1m40.992s sys 0m3.692s

# time lz4 -d ces-20160111.xml.lz4 ces-20160111.xml2 Successfully decoded 2343742714 bytes

real 0m4.416s user 0m2.079s sys 0m2.340s

So if you need decompression speed and can handle the 60% larger archive size, you may want to use that. Although I doubt that is the default requirement.

regards,

Bernardo Sulzbach

12:56 p.m.

New subject: LZ4 benchmark; was: pbzip2 proposal

On Sat, Jan 16, 2016 at 9:48 AM, Richard Jelinek rj@petamem.com wrote:

...

So if you need decompression speed and can handle the 60% larger archive size, you may want to use that. Although I doubt that is the default requirement.

This is only true for cases where multiple decompressions are required, I find it much simpler to just keep the decompressed dumps around.

(Just telling that this is not - at least for me or my coworkers - the default requirement.)

-- Bernardo Sulzbach

3260

Age (days ago)

4710

Last active (days ago)

xmldatadumps-l@lists.wikimedia.org

11 comments

6 participants

tags (0)

participants (6)

Ariel T. Glenn
Bernardo Sulzbach
Federico Leva (Nemo)
Platonides
Richard Farmbrough
Richard Jelinek