Re: [openZIM dev-l] new zimlib code checked in

12 Apr 2009

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Tommi,
Tommi Mäkitalo a écrit :
...
I would like to inform you, that I have reached a major milestone with the new 
zim format: I created successfully a zim file and read it with zimDump.
The changes are:

rewritten large parts
updated the zim file format
redesigned zimwriter

this is a very good news and I thank you for having rewriting in time
the zimwriter.
...
Let me say some words about these changes and why I did this.

Rewritten large parts:

Rewrite helped me to improve code quality. With my knowledge of today and my 
experience with the zeno file format, it was possible to clean up the library 
code.

Updated the zim file format:

Since we decided to leave the compatibility I rethought some parts of the zeno 
file format. The zeno file format did not support clustering of articles to get 
better compression. I did a minor change and added a offset and size to the 
directory entry of the article. The offset to the data blob was left in the 
article. But now multiple articles pointed to the same blob. In the new format 
I added another datastructure: the chunk, which is a collection of blobs. We 
have a pointerlist similar to the directory pointer list, which points to the 
chunks. The article addresses his blob by chunk number and blob number. Also 
redirect entries do not need these pointers at all. I just skipped them. This 
saves some bytes for each redirect.
OK, I really need to read your doc on the wiki to better understand your
explanation ;)
...

Redesign zimwriter:

Now the source of articles is abstracted from the generator. Also the database 
is not used any more for temporary data. The writer builds the directory 
entries in memory and uses a temporary file to collect the compressed data. 
This will improve performance significantly. The cavet is, that more RAM is 
used, but I estimated, that we have enough even for very large zim files.
I have tested, my first impression ist that the zimwriter is really
faster than before.
...
The abstraction of data source gives us the opportunity implement other 
sources easier, e.g. read data from the file system or wikipedia dumps without 
using the database at all.
Great, I do that (from the file system) by running a perl script
creating a DB. I have now maybe to invest time to do it directly in C++.
This seems in any case to me to be a really good arch. improvment.
You did not mention the most essential info for me! The zimwriter seems
to not die anymore with big dumps (at least by me).
...
I hope this will motivate you to go on dumping data, so that we soon can start 
testing.
I will produce big selections ZIM files (30.000 -> 50.000 articles with
small pictures) in english, french and spanish until the Linuxtag. New
beta ZIM file of the English selection will be released in the next days
(problem is not the soft, but the selection team which is not as fast as
you ;).
But I have a question: Is that possible to have a tutorial or/and a
usage() (displaying a minimal manual in case of no parameter is given)
for the zimwriter ? I search especially the way to specify the welcome
page ?
Thank you again for that job.
Emmanuel
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkniFekACgkQn3IpJRpNWtOA3ACgjY6LPJlDC7Flhjg3u5Wu6h6s
VqIAnRAumpKHftQRudrYKQqoXEgGuhLU
=OmK4
-----END PGP SIGNATURE-----

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [openZIM dev-l] new zimlib code checked in