On Thu, Dec 16, 2010 at 12:47 AM, Andrew Dunbar hippytrail@gmail.com wrote:
At the moment I'm interested in .bz2 and .7z because those are the formats WikiMedia currently publishes data in.
I'm fairly certain the specific 7z format which Wikimedia uses doesn't allow for random access, because the dictionary is never reset.
Have we made the case for this format to the WikiMedia people?
No, there's no off-the-shelf tool to create these files - the standard .xz file created by xz utils puts everything in one stream, which is basically equivalent to the .7z files already being made. I'm sure "patches are welcome", but I don't have the time to create the patch.
How is .xz for compression times?
At the default settings, it's quite slow. I believe it's pretty much the same as 7zip with its default settings. The main reason I was using xz instead of 7zip is that xz handles pipes better - specifically, 7zip doesn't allow you to pipe from stdin to stdout. (See https://bugs.launchpad.net/ubuntu/+source/p7zip/+bug/383667 and the response - "You should use lzma." - well, lzma utils has been replaced by xz utils.)
For decompression, .xz is generally faster than .bz2, slower than .gz
Would we have to worry about patent issues for LZMA?
No, it uses LZMA2.