Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

List overview All Threads
Download

newer

older

Contributing to Wikimedia as Web...

RFC cluster summary: HTML...

Randall Farmer

21 Jan 2014 21 Jan '14

5:26 p.m.

Ack, sorry for the (no subject); again in the right thread:

...

For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.

For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.

That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).

Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/

Thanks for the reply, Randall

Show replies by date

Anthony

21 Jan 21 Jan

7:40 p.m.

New subject: [Xmldatadumps-l] Compressing full-history dumps faster

If you're going to use xz then you wouldn't even have to recompress the blocks that haven't changed and are already well compressed.

On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer randall@wawd.com wrote:

...

Ack, sorry for the (no subject); again in the right thread:

...
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.

For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.

That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).

Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/

Thanks for the reply, Randall _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Amir Ladsgroup

22 Jan 22 Jan

12:47 a.m.

New subject: [Xmldatadumps-l] Compressing full-history dumps faster

One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8 GB

So why we are doing this? Best

On Wed, Jan 22, 2014 at 4:10 AM, Anthony ok@theendput.com wrote:

...

If you're going to use xz then you wouldn't even have to recompress the blocks that haven't changed and are already well compressed.

On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer randall@wawd.com wrote:

...
Ack, sorry for the (no subject); again in the right thread:

...
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.

For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.

That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).

Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/

Thanks for the reply, Randall _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Amir

Matthew Flaschen

2:01 a.m.

New subject: [Xmldatadumps-l] Compressing full-history dumps faster

On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:

...

One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8 GB

That's because the Yahoo one isn't compressed.

I'm not sure if Yahoo still uses those abstracts, but I wouldn't be surprised at all if other people are.

Matt Flaschen

Amir Ladsgroup

2:10 a.m.

New subject: [Xmldatadumps-l] Compressing full-history dumps faster

On Wed, Jan 22, 2014 at 10:31 AM, Matthew Flaschen mflaschen@wikimedia.orgwrote:

...

On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:

...
One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xml<http://dumps. wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml

...
14.1

GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http:// dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages- meta-history.xml.bz28.8 GB

That's because the Yahoo one isn't compressed.

why? can we make it compressed? It's really annoying to see that huge file

there for (even almost) no reason.

...

I'm not sure if Yahoo still uses those abstracts, but I wouldn't be surprised at all if other people are.

Matt Flaschen

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Amir

Matthew Flaschen

3:06 p.m.

New subject: [Xmldatadumps-l] Compressing full-history dumps faster

On 01/21/2014 11:10 PM, Amir Ladsgroup wrote:

...

...
why? can we make it compressed? It's really annoying to see that huge file

there for (even almost) no reason.

It's probably because it's relatively small on major wikis (e.g. English Wikipedia has it 3.8 GB). However, I see no reason not to compress it, especially when it's larger (like Wikidata's).

Matt Flaschen

4004

Age (days ago)

4005

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Amir Ladsgroup
Anthony
Matthew Flaschen
Randall Farmer