[openZIM dev-l] new zimlib code checked in

4 Apr 2009


      Hi,
I would like to inform you, that I have reached a major milestone with the new 
zim format: I created successfully a zim file and read it with zimDump.
The changes are:
    *	rewritten large parts
    *	updated the zim file format
    *	redesigned zimwriter
Let me say some words about these changes and why I did this.
* Rewritten large parts:
Rewrite helped me to improve code quality. With my knowledge of today and my 
experience with the zeno file format, it was possible to clean up the library 
code.
* Updated the zim file format:
Since we decided to leave the compatibility I rethought some parts of the zeno 
file format. The zeno file format did not support clustering of articles to get 
better compression. I did a minor change and added a offset and size to the 
directory entry of the article. The offset to the data blob was left in the 
article. But now multiple articles pointed to the same blob. In the new format 
I added another datastructure: the chunk, which is a collection of blobs. We 
have a pointerlist similar to the directory pointer list, which points to the 
chunks. The article addresses his blob by chunk number and blob number. Also 
redirect entries do not need these pointers at all. I just skipped them. This 
saves some bytes for each redirect.
* Redesign zimwriter:
Now the source of articles is abstracted from the generator. Also the database 
is not used any more for temporary data. The writer builds the directory 
entries in memory and uses a temporary file to collect the compressed data. 
This will improve performance significantly. The cavet is, that more RAM is 
used, but I estimated, that we have enough even for very large zim files.
The abstraction of data source gives us the opportunity implement other 
sources easier, e.g. read data from the file system or wikipedia dumps without 
using the database at all.
I hope this will motivate you to go on dumping data, so that we soon can start 
testing.
There is still quite some work to do for me. I need to make the zimreader 
working again. And the next big task is the full text index. My plan is to 
read the data from zim files directly and add the full text index to the zim 
files in a separate step or optionall generate a separate zim file for the index 
as it was done with the german wikipedia DVD.
Tommi

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[openZIM dev-l] new zimlib code checked in