One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8 GB
So why we are doing this? Best
On Wed, Jan 22, 2014 at 4:10 AM, Anthony ok@theendput.com wrote:
If you're going to use xz then you wouldn't even have to recompress the blocks that haven't changed and are already well compressed.
On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer randall@wawd.com wrote:
Ack, sorry for the (no subject); again in the right thread:
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply, Randall _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l