Hi,
I would like to inform you, that I have reached a major milestone with the new zim format: I created successfully a zim file and read it with zimDump.
The changes are: * rewritten large parts * updated the zim file format * redesigned zimwriter
Let me say some words about these changes and why I did this.
* Rewritten large parts:
Rewrite helped me to improve code quality. With my knowledge of today and my experience with the zeno file format, it was possible to clean up the library code.
* Updated the zim file format:
Since we decided to leave the compatibility I rethought some parts of the zeno file format. The zeno file format did not support clustering of articles to get better compression. I did a minor change and added a offset and size to the directory entry of the article. The offset to the data blob was left in the article. But now multiple articles pointed to the same blob. In the new format I added another datastructure: the chunk, which is a collection of blobs. We have a pointerlist similar to the directory pointer list, which points to the chunks. The article addresses his blob by chunk number and blob number. Also redirect entries do not need these pointers at all. I just skipped them. This saves some bytes for each redirect.
* Redesign zimwriter:
Now the source of articles is abstracted from the generator. Also the database is not used any more for temporary data. The writer builds the directory entries in memory and uses a temporary file to collect the compressed data. This will improve performance significantly. The cavet is, that more RAM is used, but I estimated, that we have enough even for very large zim files.
The abstraction of data source gives us the opportunity implement other sources easier, e.g. read data from the file system or wikipedia dumps without using the database at all.
I hope this will motivate you to go on dumping data, so that we soon can start testing.
There is still quite some work to do for me. I need to make the zimreader working again. And the next big task is the full text index. My plan is to read the data from zim files directly and add the full text index to the zim files in a separate step or optionall generate a separate zim file for the index as it was done with the german wikipedia DVD.
Tommi