Hi,
Has anyone considered a pure Javascript ZIM file reader and Wikipedia reader?
I have made a small start, writing some hack code to open a ZIM file and it gets to the point of needing to uncompress a cluster. A start has been made on the needed XZ decompress code but it's not done yet.
If practical then this could provide a portable offline reader that could suit any device with a modern web browser, and provide a reader for Android tablets, and Firefox OS, etc. Perhaps even node.js servers.
If someone has already tried this and found the performance to be too poor then please let me know?
Btw: has adding thumbnail image support to the ZIM file format been considered? This would be very handy for a ZIM file selector in a tablet or phone UI.
Regards Douglas Crosher
Hi Douglas
On 01/01/2013 02:22 AM, Douglas Crosher wrote:
Has anyone considered a pure Javascript ZIM file reader and Wikipedia reader?
No, this is complicated to do... although this could be practical. I'm also not sure if we could achieve to get acceptable performances.
I have made a small start, writing some hack code to open a ZIM file and it gets to the point of needing to uncompress a cluster. A start has been made on the needed XZ decompress code but it's not done yet.
Great. Yes, xz decompression is the most complicated part.
If practical then this could provide a portable offline reader that could suit any device with a modern web browser, and provide a reader for Android tablets, and Firefox OS, etc. Perhaps even node.js servers.
Yes, this could be pretty disruptive IMO.
If someone has already tried this and found the performance to be too poor then please let me know?
Btw: has adding thumbnail image support to the ZIM file format been considered? This would be very handy for a ZIM file selector in a tablet or phone UI.
ZIM is able to store anything, also pictures. Most of the ZIM files are with thumbnails.
Best encouragements for the next steps on your js ZIM library... please keep us informed about your progresses.
Kind regards Emmanuel
On Tue, Jan 1, 2013 at 9:09 AM, Emmanuel Engelhart kelson@kiwix.org wrote:
Btw: has adding thumbnail image support to the ZIM file format been
considered? This would be very handy for a ZIM file selector in a tablet or phone UI.
ZIM is able to store anything, also pictures. Most of the ZIM files are with thumbnails.
I think Douglas wanted to know if it was possible to add a picture to the meta-data of the ZIM itself ; like the name & author, so that a ZIM collection library can be viewed with pictures only (on a phone). Douglas, the answer is yes ; Almost all (if not all) ZIM produced by Kiwix includes a thumbnail.
renaud
On 01/01/2013 11:03 PM, renaud gaudin wrote:
On Tue, Jan 1, 2013 at 9:09 AM, Emmanuel Engelhart <kelson@kiwix.org mailto:kelson@kiwix.org> wrote:
Btw: has adding thumbnail image support to the ZIM file format been considered? This would be very handy for a ZIM file selector in a tablet or phone UI. ZIM is able to store anything, also pictures. Most of the ZIM files are with thumbnails.
I think Douglas wanted to know if it was possible to add a picture to the meta-data of the ZIM itself ; like the name & author, so that a ZIM collection library can be viewed with pictures only (on a phone). Douglas, the answer is yes ; Almost all (if not all) ZIM produced by Kiwix includes a thumbnail.
Yes, this is the question I had in mind, and I see now it in the ZIM file format specification of the metadata: 'A favicon (48x48) is also mandatory and should be located at /-/favicon '
Thanks, Douglas
On 01/01/2013 08:09 PM, Emmanuel Engelhart wrote:
Hi Douglas
On 01/01/2013 02:22 AM, Douglas Crosher wrote:
Has anyone considered a pure Javascript ZIM file reader and Wikipedia reader?
No, this is complicated to do... although this could be practical. I'm also not sure if we could achieve to get acceptable performances.
I'll hack something together to explore the performance question, and follow up.
I have made a small start, writing some hack code to open a ZIM file and it gets to the point of needing to uncompress a cluster. A start has been made on the needed XZ decompress code but it's not done yet.
Great. Yes, xz decompression is the most complicated part.
Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html
Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility.
Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this.
I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block. Further entries are likely to either be compressible or not, and could be placed in separate clusters rather than exploiting the LZMA2 support for such content.
It might even save space not having the XZ container overhead.
Regards Douglas Crosher
Douglas,
There are multiple JS LZMA libraries. I haven't looked at any of them but have you ? It might be enough for you to get a sens of performances.
renaud
On Tue, Jan 1, 2013 at 1:18 PM, Douglas Crosher dtc@scieneer.com wrote:
On 01/01/2013 08:09 PM, Emmanuel Engelhart wrote:
Hi Douglas
On 01/01/2013 02:22 AM, Douglas Crosher wrote:
Has anyone considered a pure Javascript ZIM file reader and Wikipedia reader?
No, this is complicated to do... although this could be practical. I'm also not sure if we could achieve to get acceptable performances.
I'll hack something together to explore the performance question, and follow up.
I have made a small start, writing some hack code to open a ZIM file and it gets to the point of needing to uncompress a cluster. A start has been made on the needed XZ decompress code but it's not done yet.
Great. Yes, xz decompression is the most complicated part.
Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html
Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility.
Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this.
I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block. Further entries are likely to either be compressible or not, and could be placed in separate clusters rather than exploiting the LZMA2 support for such content.
It might even save space not having the XZ container overhead.
Regards Douglas Crosher
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Hi Renaud,
Here are some I have seen: * LZMA-JS
https://github.com/nmrugg/LZMA-JS/
Online demo: http://nmrugg.github.com/LZMA-JS/demos/advanced_demo.html
* js-lzma
http://code.google.com/p/js-lzma/
I have not seen any LZMA2 or XZ Javascript libraries. I've written much of the XZ container support, and are looking at the LZMA2 support, but there seems to be a lot of unnecessary baggage in the XZ/LZMA2 containers.
First tests suggest LZMA decompression in JS is very slow, but I would not want to draw any conclusions until analysing it further.
If raw LZMA1 were an option then it could also be useful to include the CRC32 for each cluster blob so that the decompresser could exit as soon as the blob was decoded rather than having to decompress the entire cluster to verify the XZ checks. The CRC32 could be included after each blob offset. If the JS decompresser is slow then such optimizations might be essential, and might help battery life on mobile devices.
Regards Douglas Crosher
On 01/02/2013 12:23 AM, renaud gaudin wrote:
Douglas,
There are multiple JS LZMA libraries. I haven't looked at any of them but have you ? It might be enough for you to get a sens of performances.
renaud
On Tue, Jan 1, 2013 at 1:18 PM, Douglas Crosher <dtc@scieneer.com mailto:dtc@scieneer.com> wrote:
On 01/01/2013 08:09 PM, Emmanuel Engelhart wrote: > Hi Douglas > > On 01/01/2013 02:22 AM, Douglas Crosher wrote: >> Has anyone considered a pure Javascript ZIM file reader and Wikipedia >> reader? > > No, this is complicated to do... although this could be practical. I'm > also not sure if we could achieve to get acceptable performances. I'll hack something together to explore the performance question, and follow up. >> I have made a small start, writing some hack code to open a ZIM file and >> it gets to the point of needing to uncompress a cluster. A start has >> been made on the needed XZ decompress code but it's not done yet. > > Great. Yes, xz decompression is the most complicated part. Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility. Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this. I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block. Further entries are likely to either be compressible or not, and could be placed in separate clusters rather than exploiting the LZMA2 support for such content. It might even save space not having the XZ container overhead. Regards Douglas Crosher _______________________________________________ Offline-l mailing list Offline-l@lists.wikimedia.org <mailto:Offline-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/offline-l
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Le 01/01/2013 15:09, Douglas Crosher a écrit :
If raw LZMA1 were an option then it could also be useful to include the CRC32 for each cluster blob so that the decompresser could exit as soon as the blob was decoded rather than having to decompress the entire cluster to verify the XZ checks. The CRC32 could be included after each blob offset. If the JS decompresser is slow then such optimizations might be essential, and might help battery life on mobile devices.
Yes, this is interesting... but LZMA is not an option. Is that impossible to do with LZMA2?
Emmanuel
Am 01.01.2013 15:09, schrieb Douglas Crosher:
Hi Renaud,
Here are some I have seen:
- LZMA-JS
https://github.com/nmrugg/LZMA-JS/
Online demo: http://nmrugg.github.com/LZMA-JS/demos/advanced_demo.html
- js-lzma
http://code.google.com/p/js-lzma/
I have not seen any LZMA2 or XZ Javascript libraries. I've written much of the XZ container support, and are looking at the LZMA2 support, but there seems to be a lot of unnecessary baggage in the XZ/LZMA2 containers.
First tests suggest LZMA decompression in JS is very slow, but I would not want to draw any conclusions until analysing it further.
If raw LZMA1 were an option then it could also be useful to include the CRC32 for each cluster blob so that the decompresser could exit as soon as the blob was decoded rather than having to decompress the entire cluster to verify the XZ checks. The CRC32 could be included after each blob offset. If the JS decompresser is slow then such optimizations might be essential, and might help battery life on mobile devices.
Regards Douglas Crosher
On 01/02/2013 12:23 AM, renaud gaudin wrote:
Douglas,
There are multiple JS LZMA libraries. I haven't looked at any of them but have you ? It might be enough for you to get a sens of performances.
renaud
On Tue, Jan 1, 2013 at 1:18 PM, Douglas Crosher <dtc@scieneer.com mailto:dtc@scieneer.com> wrote:
On 01/01/2013 08:09 PM, Emmanuel Engelhart wrote: > Hi Douglas > > On 01/01/2013 02:22 AM, Douglas Crosher wrote: >> Has anyone considered a pure Javascript ZIM file reader and Wikipedia >> reader? > > No, this is complicated to do... although this could be practical. I'm > also not sure if we could achieve to get acceptable performances. I'll hack something together to explore the performance question, and follow up. >> I have made a small start, writing some hack code to open a ZIM file and >> it gets to the point of needing to uncompress a cluster. A start has >> been made on the needed XZ decompress code but it's not done yet. > > Great. Yes, xz decompression is the most complicated part. Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility. Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this. I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block. Further entries are likely to either be compressible or not, and could be placed in separate clusters rather than exploiting the LZMA2 support for such content. It might even save space not having the XZ container overhead. Regards Douglas Crosher
I wonder, if these js libraries actually use lzma2. Both claim, that they are a port of the java sdk, which is a port of lzma sdk, which supports lzma2.
On 01/04/2013 04:55 AM, Tommi Mäkitalo wrote:
Am 01.01.2013 15:09, schrieb Douglas Crosher:
Hi Renaud,
Here are some I have seen:
- LZMA-JS
https://github.com/nmrugg/LZMA-JS/
Online demo: http://nmrugg.github.com/LZMA-JS/demos/advanced_demo.html
- js-lzma
...
I wonder, if these js libraries actually use lzma2. Both claim, that they are a port of the java sdk, which is a port of lzma sdk, which supports lzma2.
They definitely only support LZMA, not LZMA2. LZMA2 is not that much more difficult and can easily be written. Trying to optimize the decoder for Javascript is taking much more time that adding LZMA2 support.
Regards Douglas Crosher
Hi Douglas
I will try to answer you, tell it if somehow I misunderstand you.
Le 01/01/2013 14:18, Douglas Crosher a écrit :
Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html
Why? ZIM only supports the LZMA2 compression. So this would be perfect. As far as I have understood xz-embedded is only a decompressor with limited features.
Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility.
We can not change the chosen compression algorithm for the ZIM format.
Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this.
We recommend to compress only text content and consequently pictures are usually not compressed in ZIM files. The amount of text compressed in one cluster is chosen by the ZIM creator, at Kiwix it's 1MB (a size we should maybe reconsider and increase).
I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block.
Not sure to understand you right, but HTML entries are concatenated *before* being compressed.
Further entries are likely to either be compressible or not, and could be placed in separate clusters rather than exploiting the LZMA2 support for such content.
That is the case.
It might even save space not having the XZ container overhead.
As far as I know stream are LZMA2 encoded and do not use the XZ format.
Regards Emmanuel
Hi,
there should be only one compression algorithm. Otherwise a reader must be able to handle every supported algorithm. What is the point of having a standard format where some readers could read only part of them?
The zimwriter makes clusters of 1MB of html files and compresses them with lzma2. Actually no xz overhead is used here. The 1MB cluster size is choosen because lzma2 uses it. Larger clusters do not increase the compression ratioa at all.
The writer has a fixed list of mime types, which are not compressed. The mime types are "image/jpeg", "image/png", "image/tiff", "image/gif" and"application/zip". The writer do not try to compress them further but they are stored as is in a separate cluster.
Tommi
On 01/04/2013 04:49 AM, Tommi Mäkitalo wrote:
Hi,
there should be only one compression algorithm. Otherwise a reader must be able to handle every supported algorithm. What is the point of having a standard format where some readers could read only part of them?
A Javascript decoder is so slow that optimizing the supported compression algorithm containers might be warranted.
It might only make a performance difference. There might be a new container format that simply loads faster on average.
The zimwriter makes clusters of 1MB of html files and compresses them with lzma2. Actually no xz overhead is used here. The 1MB cluster size is choosen because lzma2 uses it. Larger clusters do not increase the compression ratioa at all.
The clusters are compressed using the XZ container which has streams and then blocks of LZMA2 and the LZMA2 container then uses chunks of either LZMA compressed data or uncompressed data. There may be some unnecessary baggage here and these containers may not be optimal for the ZIM format. If the decoding time can on average be halved by changing the containers then it might warrant consideration.
The writer has a fixed list of mime types, which are not compressed. The mime types are "image/jpeg", "image/png", "image/tiff", "image/gif" and"application/zip". The writer do not try to compress them further but they are stored as is in a separate cluster.
For this reason the LZMA2 container may be redundant. LZMA2 added support for uncompressed chunks, but since much of the uncompressible blobs are placed in separate clusters this extra LZMA2 support may just be baggage. I note that having all the images in non-compressed clusters will help make a Javascript port more practical as this means that there will be less clusters to decode in a typical page.
Regards Douglas Crosher
On 01/03/2013 10:29 PM, Douglas Crosher wrote:
It might only make a performance difference. There might be a new container format that simply loads faster on average.
I am positive that LZMA2 is quite suited for JS in terms of performance (or at least not worse than other compressors). When we developed the ZIM file format we had small devices with only little ressources in mind, the first users of openZIM were small gadgets like the Ben NanoNote. LZMA has the disadvantage that it is quite expensive in compressing but gets better compression ration than others. On the other hand decompression is quite cheap compared with others, so it was the perfect choice for these small platforms.
There was a discussion whether we could go a step further for devices with very limited memory and not decompress and cache a whole cluster when accessing it but just use the parts from the decompression stream we need and stop reading once we have all data needed, forgetting everything else. I think that has been implemented in zimlib.
/Manuel
On 01/04/2013 08:45 AM, Manuel Schneider wrote:
On 01/03/2013 10:29 PM, Douglas Crosher wrote:
It might only make a performance difference. There might be a new container format that simply loads faster on average.
I am positive that LZMA2 is quite suited for JS in terms of performance (or at least not worse than other compressors). When we developed the ZIM file format we had small devices with only little ressources in mind, the first users of openZIM were small gadgets like the Ben NanoNote. LZMA has the disadvantage that it is quite expensive in compressing but gets better compression ration than others. On the other hand decompression is quite cheap compared with others, so it was the perfect choice for these small platforms.
There was a discussion whether we could go a step further for devices with very limited memory and not decompress and cache a whole cluster when accessing it but just use the parts from the decompression stream we need and stop reading once we have all data needed, forgetting everything else. I think that has been implemented in zimlib.
I would like to implement this for the Javascript version too, but the containers do not support this well. The XZ CRC is at the end of the block so in practice the entire cluster needs to be decoded. The XZ container may not be a good match for the ZIM file format. It would be better to have a CRC for each cluster blob.
Regards Douglas Crosher
On 01/04/2013 12:08 AM, Douglas Crosher wrote:
I would like to implement this for the Javascript version too, but the containers do not support this well. The XZ CRC is at the end of the block so in practice the entire cluster needs to be decoded. The XZ container may not be a good match for the ZIM file format. It would be better to have a CRC for each cluster blob.
we are not using any containers, just the naked LZMA2 compression. The container is the ZIM file, respectively the blob in the cluster where we put the data in.
/Manuel
Le 03/01/2013 18:49, Tommi Mäkitalo a écrit :
Actually no xz overhead is used here.
The format specification claims the contrary:
"compressed clusters are indicated by a value of 4 which indicates LZMA2 compression (or more precisely XZ, since there is a XZ header). http://openzim.org/index.php/ZIM_File_Format#Clusters
Who is wrong?
Emmanuel
Am 04.01.2013 09:36, schrieb Emmanuel Engelhart:
Le 03/01/2013 18:49, Tommi Mäkitalo a écrit :
Actually no xz overhead is used here.
The format specification claims the contrary:
"compressed clusters are indicated by a value of 4 which indicates LZMA2 compression (or more precisely XZ, since there is a XZ header). http://openzim.org/index.php/ZIM_File_Format#Clusters
Who is wrong?
Emmanuel
I can always say, that I'm right since I wrote the mail and the documentation ;-) . The documentation is right. A xz header is included in the cluster.
Tommi
Hi Emmanuel,
Thank you for the explanation.
On 01/04/2013 04:32 AM, Emmanuel Engelhart wrote:
Hi Douglas
I will try to answer you, tell it if somehow I misunderstand you.
Le 01/01/2013 14:18, Douglas Crosher a écrit :
Would it be very limiting on ZIM files if the XZ decoder were restricted to the 'XZ embedded' format, supporting only the 'LZMA2' filter? See: http://tukaani.org/xz/embedded.html
Why? ZIM only supports the LZMA2 compression. So this would be perfect. As far as I have understood xz-embedded is only a decompressor with limited features.
Great, then this looks good.
Do ZIM files really need the XZ/LZMA2 containers, or could they just use raw LZMA1 compression? This could be added as a new cluster compression type for compatibility.
We can not change the chosen compression algorithm for the ZIM format.
The ZIM file format does have provision for new cluster compression formats, and it would appear practical to add a new format and depreciate an old format.
Two possible uses for XZ/LZMA2 may be for large entries and/or entries with distinct regions that are compressible and not compressible. However perhaps a significant amount of content does not need this.
We recommend to compress only text content and consequently pictures are usually not compressed in ZIM files. The amount of text compressed in one cluster is chosen by the ZIM creator, at Kiwix it's 1MB (a size we should maybe reconsider and increase).
Increasing the cluster size would hurt slow devices and devices with limited memory. It would be interesting to know the potential reduction in compressed size though.
I expect that typical HTML entries would be relatively small. It would seem pointless for a cluster to use multiple XZ blocks and/or streams when these could be avoided by placing entries in separate clusters. So perhaps there is a case for clusters with just one LZMA1 block.
Not sure to understand you right, but HTML entries are concatenated *before* being compressed.
LZMA2 has a feature that allows it to insert uncompressed blocks. Since blobs, such as images, are place in separate uncompressed clusters, this LZMA2 feature is probably not needed.
It might even save space not having the XZ container overhead.
As far as I know stream are LZMA2 encoded and do not use the XZ format.
They do appear to use the XZ container, and this is documented in the ZIM file format specification at: http://openzim.org/index.php/ZIM_File_Format
Regards Douglas Crosher