On 01/04/2013 04:49 AM, Tommi Mäkitalo wrote:
Hi,
there should be only one compression algorithm. Otherwise a reader must be able to handle every supported algorithm. What is the point of having a standard format where some readers could read only part of them?
A Javascript decoder is so slow that optimizing the supported compression algorithm containers might be warranted.
It might only make a performance difference. There might be a new container format that simply loads faster on average.
The zimwriter makes clusters of 1MB of html files and compresses them with lzma2. Actually no xz overhead is used here. The 1MB cluster size is choosen because lzma2 uses it. Larger clusters do not increase the compression ratioa at all.
The clusters are compressed using the XZ container which has streams and then blocks of LZMA2 and the LZMA2 container then uses chunks of either LZMA compressed data or uncompressed data. There may be some unnecessary baggage here and these containers may not be optimal for the ZIM format. If the decoding time can on average be halved by changing the containers then it might warrant consideration.
The writer has a fixed list of mime types, which are not compressed. The mime types are "image/jpeg", "image/png", "image/tiff", "image/gif" and"application/zip". The writer do not try to compress them further but they are stored as is in a separate cluster.
For this reason the LZMA2 container may be redundant. LZMA2 added support for uncompressed chunks, but since much of the uncompressible blobs are placed in separate clusters this extra LZMA2 support may just be baggage. I note that having all the images in non-compressed clusters will help make a Javascript port more practical as this means that there will be less clusters to decode in a typical page.
Regards Douglas Crosher