Hi Douglas
I will try to answer you, tell it if somehow I misunderstand you.
Le 01/01/2013 14:18, Douglas Crosher a écrit :
Would it be very limiting on ZIM files if the XZ
decoder were restricted
to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See:
http://tukaani.org/xz/embedded.html
Why? ZIM only supports the LZMA2 compression. So this would be perfect.
As far as I have understood xz-embedded is only a decompressor with
limited features.
Do ZIM files really need the XZ/LZMA2 containers, or
could they just use
raw LZMA1 compression? This could be added as a new cluster compression
type for compatibility.
We can not change the chosen compression algorithm for the ZIM format.
Two possible uses for XZ/LZMA2 may be for large
entries and/or entries
with distinct regions that are compressible and not compressible.
However perhaps a significant amount of content does not need this.
We recommend to compress only text content and consequently pictures are
usually not compressed in ZIM files. The amount of text compressed in
one cluster is chosen by the ZIM creator, at Kiwix it's 1MB (a size we
should maybe reconsider and increase).
I expect that typical HTML entries would be relatively
small. It would
seem pointless for a cluster to use multiple XZ blocks and/or streams
when these could be avoided by placing entries in separate clusters. So
perhaps there is a case for clusters with just one LZMA1 block.
Not sure to understand you right, but HTML entries are concatenated
*before* being compressed.
Further
entries are likely to either be compressible or not, and could be placed
in separate clusters rather than exploiting the LZMA2 support for such
content.
That is the case.
It might even save space not having the XZ container
overhead.
As far as I know stream are LZMA2 encoded and do not use the XZ format.
Regards
Emmanuel