-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Tommi,
Tommi Mäkitalo a écrit :
I would like to inform you, that I have reached a major milestone with the new zim format: I created successfully a zim file and read it with zimDump.
The changes are:
- rewritten large parts
- updated the zim file format
- redesigned zimwriter
this is a very good news and I thank you for having rewriting in time the zimwriter.
Let me say some words about these changes and why I did this.
- Rewritten large parts:
Rewrite helped me to improve code quality. With my knowledge of today and my experience with the zeno file format, it was possible to clean up the library code.
- Updated the zim file format:
Since we decided to leave the compatibility I rethought some parts of the zeno file format. The zeno file format did not support clustering of articles to get better compression. I did a minor change and added a offset and size to the directory entry of the article. The offset to the data blob was left in the article. But now multiple articles pointed to the same blob. In the new format I added another datastructure: the chunk, which is a collection of blobs. We have a pointerlist similar to the directory pointer list, which points to the chunks. The article addresses his blob by chunk number and blob number. Also redirect entries do not need these pointers at all. I just skipped them. This saves some bytes for each redirect.
OK, I really need to read your doc on the wiki to better understand your explanation ;)
- Redesign zimwriter:
Now the source of articles is abstracted from the generator. Also the database is not used any more for temporary data. The writer builds the directory entries in memory and uses a temporary file to collect the compressed data. This will improve performance significantly. The cavet is, that more RAM is used, but I estimated, that we have enough even for very large zim files.
I have tested, my first impression ist that the zimwriter is really faster than before.
The abstraction of data source gives us the opportunity implement other sources easier, e.g. read data from the file system or wikipedia dumps without using the database at all.
Great, I do that (from the file system) by running a perl script creating a DB. I have now maybe to invest time to do it directly in C++. This seems in any case to me to be a really good arch. improvment.
You did not mention the most essential info for me! The zimwriter seems to not die anymore with big dumps (at least by me).
I hope this will motivate you to go on dumping data, so that we soon can start testing.
I will produce big selections ZIM files (30.000 -> 50.000 articles with small pictures) in english, french and spanish until the Linuxtag. New beta ZIM file of the English selection will be released in the next days (problem is not the soft, but the selection team which is not as fast as you ;).
But I have a question: Is that possible to have a tutorial or/and a usage() (displaying a minimal manual in case of no parameter is given) for the zimwriter ? I search especially the way to specify the welcome page ?
Thank you again for that job.
Emmanuel