Hi Douglas
I will try to answer you, tell it if somehow I misunderstand you.
Le 01/01/2013 14:18, Douglas Crosher a écrit :
Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html
Why? ZIM only supports the LZMA2 compression. So this would be perfect. As far as I have understood xz-embedded is only a decompressor with limited features.
Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility.
We can not change the chosen compression algorithm for the ZIM format.
Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this.
We recommend to compress only text content and consequently pictures are usually not compressed in ZIM files. The amount of text compressed in one cluster is chosen by the ZIM creator, at Kiwix it's 1MB (a size we should maybe reconsider and increase).
I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block.
Not sure to understand you right, but HTML entries are concatenated *before* being compressed.
Further entries are likely to either be compressible or not, and could be placed in separate clusters rather than exploiting the LZMA2 support for such content.
That is the case.
It might even save space not having the XZ container overhead.
As far as I know stream are LZMA2 encoded and do not use the XZ format.
Regards Emmanuel