[openZIM dev-l] new zimlib code checked in

List overview All Threads
Download

newer

older

[openZIM dev-l] windows status of...

[openZIM dev-l] layout of pages in...

Tommi Mäkitalo

4 Apr 2009 4 Apr '09

7:40 p.m.

Hi,

I would like to inform you, that I have reached a major milestone with the new zim format: I created successfully a zim file and read it with zimDump.

The changes are: * rewritten large parts * updated the zim file format * redesigned zimwriter

Let me say some words about these changes and why I did this.

* Rewritten large parts:

Rewrite helped me to improve code quality. With my knowledge of today and my experience with the zeno file format, it was possible to clean up the library code.

* Updated the zim file format:

Since we decided to leave the compatibility I rethought some parts of the zeno file format. The zeno file format did not support clustering of articles to get better compression. I did a minor change and added a offset and size to the directory entry of the article. The offset to the data blob was left in the article. But now multiple articles pointed to the same blob. In the new format I added another datastructure: the chunk, which is a collection of blobs. We have a pointerlist similar to the directory pointer list, which points to the chunks. The article addresses his blob by chunk number and blob number. Also redirect entries do not need these pointers at all. I just skipped them. This saves some bytes for each redirect.

* Redesign zimwriter:

Now the source of articles is abstracted from the generator. Also the database is not used any more for temporary data. The writer builds the directory entries in memory and uses a temporary file to collect the compressed data. This will improve performance significantly. The cavet is, that more RAM is used, but I estimated, that we have enough even for very large zim files.

The abstraction of data source gives us the opportunity implement other sources easier, e.g. read data from the file system or wikipedia dumps without using the database at all.

I hope this will motivate you to go on dumping data, so that we soon can start testing.

There is still quite some work to do for me. I need to make the zimreader working again. And the next big task is the full text index. My plan is to read the data from zim files directly and add the full text index to the zim files in a separate step or optionall generate a separate zim file for the index as it was done with the german wikipedia DVD.

Tommi

Show replies by date

Manuel Schneider

4 Apr 4 Apr

9:52 p.m.

Hi Tommi and the rest of the team,

thanks for committing your changes and of course thank you very much for your work on the project. I am currently sitting at the Wikimedia Board Panel at the Wikimedia Conference (chapters meeting) in Berlin.

Yesterday I had an interestng discussion with Asaf Bartov. He is volunteer at Wikimedia Israel and as a professional he is developper of transactional systems für banks. Wikimedia Israel has plans to get the hebrew Wikipedia onto the OLPC and is looking for technical solutions. Asaf already worked on Bzreader which is a Reader of the bzip2ed XML dumps the WMF provides. Bzreader runs on ReactOS-compatibles and has to parse the MW syntax by itself and has a built-in browser. Asaf already fixed some bugs, but when he heard me in the introductionery session mentioning openZIM he started thinking about better investing his effort into something which already works. We looked into openZIM yesterday, checked out the stuff from Subversion, compiled it and played with it. I am pretty sure that Asaf will be quite happy with the update now and I will try to get him on board of the development team.

This would then also require more frequently commits, so Asaf is able to work on current code and submit patches.

He also had Kiwix on his hard drive and we updated from Subversion, so we realised that also Emmanuel did some work on his code, zim replaced zeno now etc.

There are two other Wikipedia DVD projects on this conference: * Wikimedia Polska made a Wikipedia DVD based on the HTML dumps the WMF provides, supplemented with a search engine written in Java, so it runs as an applet in the browser on any platform. There will be no other releases as it didn't sell well in Poland.

* Wikimedia Italia has a DVD which will be released in a newer version this year, but I haven't been able to find out more about it yet.

The main point I think is to get now the dumps working. Can please someone try to put together the software needed and commit it to our Subversion?

I had a private discussion with Mike Godwin, the general counselor of the WMF, about the trademarks etc. As we publish the DVD with Wikimedia CH and it is a non-commercial project, I will need the approval from Mike Godwin (every single usage of the Wikipedia logo need approval now), but he is willing to give it to me.

Greets,

Manuel

-- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch

Emmanuel Engelhart

13 Apr 13 Apr

1:23 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Manuel Schneider a écrit :

...

Yesterday I had an interestng discussion with Asaf Bartov. He is volunteer at Wikimedia Israel and as a professional he is developper of transactional systems für banks. Wikimedia Israel has plans to get the hebrew Wikipedia onto the OLPC and is looking for technical solutions. Asaf already worked on Bzreader which is a Reader of the bzip2ed XML dumps the WMF provides. Bzreader runs on ReactOS-compatibles and has to parse the MW syntax by itself and has a built-in browser. Asaf already fixed some bugs, but when he heard me in the introductionery session mentioning openZIM he started thinking about better investing his effort into something which already works.

That would be pretty good, the OLPC bzip2 based storage system is not bad, but it is conceptually linked to bz2 and at least for that reason is not the best solution. We have to convince the OLPC teams that they have only things to win to adopt ZIM, thank you for doing that Manuel.

...

We looked into openZIM yesterday, checked out the stuff from Subversion, compiled it and played with it. I am pretty sure that Asaf will be quite happy with the update now and I will try to get him on board of the development team.

Super!

...

This would then also require more frequently commits, so Asaf is able to work on current code and submit patches.

He also had Kiwix on his hard drive and we updated from Subversion, so we realised that also Emmanuel did some work on his code, zim replaced zeno now etc.

Yes, that's true. It works again now after Tommi's big rewrite.

...

There are two other Wikipedia DVD projects on this conference:

Wikimedia Polska made a Wikipedia DVD based on the HTML dumps the WMF

provides, supplemented with a search engine written in Java, so it runs as an applet in the browser on any platform.

There will be no other releases as it didn't sell well in Poland.

Ok.

...

Wikimedia Italia has a DVD which will be released in a newer version this

year, but I haven't been able to find out more about it yet.

The main point I think is to get now the dumps working. Can please someone try to put together the software needed and commit it to our Subversion?

Yes, having different dumps seems also to me to be the priority.

...

I had a private discussion with Mike Godwin, the general counselor of the WMF, about the trademarks etc. As we publish the DVD with Wikimedia CH and it is a non-commercial project, I will need the approval from Mike Godwin (every single usage of the Wikipedia logo need approval now), but he is willing to give it to me.

I'm always sceptical about the Foundation, especially concerning the offline contents topic. My experience shows that they were always more a brake than a help... but thank you very much for your lobbying : this is important and I can't do such things ;)

Regards Emmanuel

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkni73YACgkQn3IpJRpNWtO9/QCgkLUmDzyoOyUpxbo9Cj8yjfkJ RUsAoIQWxTudDh1tTbfxut/TB8eV5fVx =bJzh -----END PGP SIGNATURE-----

Manuel Schneider

4:04 p.m.

Am Montag, 13. April 2009 schrieb Emmanuel Engelhart:

...

That would be pretty good, the OLPC bzip2 based storage system is not bad, but it is conceptually linked to bz2 and at least for that reason is not the best solution. We have to convince the OLPC teams that they have only things to win to adopt ZIM, thank you for doing that Manuel.

we have to distinguish here between the "official" OLPC Wikipedia Offline efforts (which you can read about on wikireader@lists.laptop.org - where I am subscribed to but I haven't got any mails since months) and what Asaf is working on.

But anyway: The bz2reader has another major drawback: It parses the content (Wikitext) in the reader, which doesn't really work for templates, parser functions etc. On the other hand you can just use the XML dumps prepared by Wikimedia which eliminates the need of a dumping process and tools. But then we have to remember that the dumping process by Wikimedia is quite unstable and not very regular.

...

...
He also had Kiwix on his hard drive and we updated from Subversion, so we realised that also Emmanuel did some work on his code, zim replaced zeno now etc.

Yes, that's true. It works again now after Tommi's big rewrite.

I would like to have Kiwix on the LinuxTag DVD, is that ok? Will it be ready? As far as I know Kiwix also runs on Windows, right?

...

...
I had a private discussion with Mike Godwin, the general counselor of the WMF, about the trademarks etc. As we publish the DVD with Wikimedia CH and it is a non-commercial project, I will need the approval from Mike Godwin (every single usage of the Wikipedia logo need approval now), but he is willing to give it to me.

I'm always sceptical about the Foundation, especially concerning the offline contents topic. My experience shows that they were always more a brake than a help... but thank you very much for your lobbying : this is important and I can't do such things ;)

I am also sceptical. We are an independant project, we don't rely on the WMF, so this shouldn't be an issue for us. On the other hand since Erik contacted me out of his own interest, we might benefit from a partnership, anyway it will be like. If we have any ideas or requirements which could be supported by the Foundation this is now the time where we could raise them.

Cheers,

Manuel

-- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch

Emmanuel Engelhart

6:54 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Manuel Schneider a écrit :

...

...
...
He also had Kiwix on his hard drive and we updated from Subversion, so we realised that also Emmanuel did some work on his code, zim replaced zeno now etc.

Yes, that's true. It works again now after Tommi's big rewrite.

I would like to have Kiwix on the LinuxTag DVD, is that ok? Will it be ready? As far as I know Kiwix also runs on Windows, right?

Few infos about Kiwix dev. status: * 5 months ago, Linterweb has left the project and I'm now more or less alone on it ; although different big supports in particulary from Moulinwiki. * Linterweb did not release (as free software) a part of the code of Kiwix 0.7, the last version. * For this reason, Kiwix 0.7 is not supported anymore by me and I work on Kiwix 0.8. * Although Kiwix 0.5 & 0.7 were running under linux-win-mac, Kiwix 0.8 will only run with Linux : main cause is that I do not know how to have a libzim working with Windows. * Having Kiwix working with windows is my next priority after generating few essential ZIM files (second half part of the year). * Kiwix 0.8 dev. status is short before beta and it will be more a ZIM file reader (allowing to work easily with multiple files at the same time) than anything else and I will continue to invest time to produce contents using this format and work on scripts to do that. * Timeline for 0.8 should be Linuxtag compatible, please give me a dead line for the DVD.

Regards Emmanuel

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknjPRgACgkQn3IpJRpNWtPHSACdFWwylJrnODbDs5LhsUv7lSlv gAYAoMhprMNkdkE3hn7J07IU6omQMjvM =0MSv -----END PGP SIGNATURE-----

Emmanuel Engelhart

12 Apr 12 Apr

9:55 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi Tommi,

Tommi Mäkitalo a écrit :

...

I would like to inform you, that I have reached a major milestone with the new zim format: I created successfully a zim file and read it with zimDump.

The changes are:

rewritten large parts

updated the zim file format

redesigned zimwriter

this is a very good news and I thank you for having rewriting in time the zimwriter.

...

Let me say some words about these changes and why I did this.

Rewritten large parts:

Rewrite helped me to improve code quality. With my knowledge of today and my experience with the zeno file format, it was possible to clean up the library code.

Updated the zim file format:

Since we decided to leave the compatibility I rethought some parts of the zeno file format. The zeno file format did not support clustering of articles to get better compression. I did a minor change and added a offset and size to the directory entry of the article. The offset to the data blob was left in the article. But now multiple articles pointed to the same blob. In the new format I added another datastructure: the chunk, which is a collection of blobs. We have a pointerlist similar to the directory pointer list, which points to the chunks. The article addresses his blob by chunk number and blob number. Also redirect entries do not need these pointers at all. I just skipped them. This saves some bytes for each redirect.

OK, I really need to read your doc on the wiki to better understand your explanation ;)

...

Redesign zimwriter:

Now the source of articles is abstracted from the generator. Also the database is not used any more for temporary data. The writer builds the directory entries in memory and uses a temporary file to collect the compressed data. This will improve performance significantly. The cavet is, that more RAM is used, but I estimated, that we have enough even for very large zim files.

I have tested, my first impression ist that the zimwriter is really faster than before.

...

The abstraction of data source gives us the opportunity implement other sources easier, e.g. read data from the file system or wikipedia dumps without using the database at all.

Great, I do that (from the file system) by running a perl script creating a DB. I have now maybe to invest time to do it directly in C++. This seems in any case to me to be a really good arch. improvment.

You did not mention the most essential info for me! The zimwriter seems to not die anymore with big dumps (at least by me).

...

I hope this will motivate you to go on dumping data, so that we soon can start testing.

I will produce big selections ZIM files (30.000 -> 50.000 articles with small pictures) in english, french and spanish until the Linuxtag. New beta ZIM file of the English selection will be released in the next days (problem is not the soft, but the selection team which is not as fast as you ;).

But I have a question: Is that possible to have a tutorial or/and a usage() (displaying a minimal manual in case of no parameter is given) for the zimwriter ? I search especially the way to specify the welcome page ?

Thank you again for that job.

Emmanuel

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkniFekACgkQn3IpJRpNWtOA3ACgjY6LPJlDC7Flhjg3u5Wu6h6s VqIAnRAumpKHftQRudrYKQqoXEgGuhLU =OmK4 -----END PGP SIGNATURE-----

5705

Age (days ago)

5714

Last active (days ago)

offline-l@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

Emmanuel Engelhart
Manuel Schneider
Tommi Mäkitalo