Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

21 Jan 2014


      One of the things I can't understand is why we are extracting summary of
pages for Yahoo? Is it our job to do it? the dumps are really huge
e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/
wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1
GB
Compare it to: full history:
wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8
GB
So why we are doing this?
Best
On Wed, Jan 22, 2014 at 4:10 AM, Anthony ok@theendput.com wrote:
...
If you're going to use xz then you wouldn't even have to recompress the
blocks that haven't changed and are already well compressed.
On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer randall@wawd.com wrote:
...
Ack, sorry for the (no subject); again in the right thread:
...
For external uses like XML dumps integrating the compression
strategy into LZMA would however be very attractive. This would also
benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- 
Amir

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster