Ack, sorry for the (no subject); again in the right thread:
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply, Randall
If you're going to use xz then you wouldn't even have to recompress the blocks that haven't changed and are already well compressed.
On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer randall@wawd.com wrote:
Ack, sorry for the (no subject); again in the right thread:
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply, Randall _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8 GB
So why we are doing this? Best
On Wed, Jan 22, 2014 at 4:10 AM, Anthony ok@theendput.com wrote:
If you're going to use xz then you wouldn't even have to recompress the blocks that haven't changed and are already well compressed.
On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer randall@wawd.com wrote:
Ack, sorry for the (no subject); again in the right thread:
For external uses like XML dumps integrating the compression strategy into LZMA would however be very attractive. This would also benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply, Randall _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:
One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8 GB
That's because the Yahoo one isn't compressed.
I'm not sure if Yahoo still uses those abstracts, but I wouldn't be surprised at all if other people are.
Matt Flaschen
On Wed, Jan 22, 2014 at 10:31 AM, Matthew Flaschen mflaschen@wikimedia.orgwrote:
On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:
One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xml<http://dumps. wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml
14.1
GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http:// dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages- meta-history.xml.bz28.8 GB
That's because the Yahoo one isn't compressed.
why? can we make it compressed? It's really annoying to see that huge file
there for (even almost) no reason.
I'm not sure if Yahoo still uses those abstracts, but I wouldn't be surprised at all if other people are.
Matt Flaschen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 01/21/2014 11:10 PM, Amir Ladsgroup wrote:
why? can we make it compressed? It's really annoying to see that huge file
there for (even almost) no reason.
It's probably because it's relatively small on major wikis (e.g. English Wikipedia has it 3.8 GB). However, I see no reason not to compress it, especially when it's larger (like Wikidata's).
Matt Flaschen
wikitech-l@lists.wikimedia.org