Hi Emmanuel,
Thank you for the explanation.
On 01/04/2013 04:32 AM, Emmanuel Engelhart wrote:
Hi Douglas
I will try to answer you, tell it if somehow I misunderstand you.
Le 01/01/2013 14:18, Douglas Crosher a écrit :
Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html
Why? ZIM only supports the LZMA2 compression. So this would be perfect. As far as I have understood xz-embedded is only a decompressor with limited features.
Great, then this looks good.
Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility.
We can not change the chosen compression algorithm for the ZIM format.
The ZIM file format does have provision for new cluster compression formats, and it would appear practical to add a new format and depreciate an old format.
Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this.
We recommend to compress only text content and consequently pictures are usually not compressed in ZIM files. The amount of text compressed in one cluster is chosen by the ZIM creator, at Kiwix it's 1MB (a size we should maybe reconsider and increase).
Increasing the cluster size would hurt slow devices and devices with limited memory. It would be interesting to know the potential reduction in compressed size though.
I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block.
Not sure to understand you right, but HTML entries are concatenated *before* being compressed.
LZMA2 has a feature that allows it to insert uncompressed blocks. Since blobs, such as images, are place in separate uncompressed clusters, this LZMA2 feature is probably not needed.
It might even save space not having the XZ container overhead.
As far as I know stream are LZMA2 encoded and do not use the XZ format.
They do appear to use the XZ container, and this is documented in the ZIM file format specification at: http://openzim.org/index.php/ZIM_File_Format
Regards Douglas Crosher