Hi,
I've some ideas concerning libzim, which I would like to share with you.
There is a class zim::Files, which represents a set of zim-files in one
directory. I would like to drop that class, if it is not used in Kiwix.
(Emmanuel: Can you verify that?)
Let me explain, why the class is there, why I would like to remove it and how
to replace the functionality.
In the original german wikipedia-DVD the content of wikipedia was in one zeno-
file and the word index, which is also a zeno-file was in another. Images where
there in a very bad quality due to limited capacity of the DVD. There was a
compact and premium edition. The premium edition had images in a much better
quality on 3 additional DVDs. Each had a zeno file with part of the images. To
use these better quality images, the user had to copy all zeno files into one
directory on his hard drive and configure the reader to use that directory.
When images were requested, the reader then searches the image in all zeno
files and fetches the one with the best quality.
I feel, that this solution is not optimal. There is also a technical reason,
why this is not that good and since I have a better idea (at least I think the
idea is better ;-) ), I would like to drop that feature. The technical reason
is simple: there is no common API to read directories. Almost all operating
systems has opendir/readdir/closedir. Unfortunately reactos (and other
systmes, which use win32) do not have these functions. So the zimlib has to
use a different API for these systems. This is not really a big problem but
still a problem. It is actually the only code, which is OS specific in zimlib
(or actually in the used cxxtools, which provides a wrapper for these
functions).
The functionality is not needed on the planned wikipedia DVD for the linux
tag, since we won't have a premium edition but have all data in one single zim
file.
In the future I plan to create a utility, which merges the content of 2 or
more zim files into one. The structure of zim files makes this a quite easy
operation. Much easier than creating new zim files. Especially the changes I
made compared to zeno makes it very easy and fast. Combining 2 zim files is
almost as cheap as copying these files. So for users instead of copying all zim
files into one directory, they can just combine multiple zim files to creat one
single big file.
The utility will have an option to control how to handle duplicate articles.
The utility may prefer the articles of one of the files, so it will be possible
to make update files, which just provides the changed files. The resulting
combined file will not be exactly the same as a new file, since it will have
empty blob entries for removed article data. But the user won't see any
difference.
Tommi
Great!
I am also curious hearing from Kiwix. Will it be ready so we can put both
readers onto the DVD?
Greets,
Manuel
Am Samstag, 11. April 2009 schrieb Tommi Mäkitalo:
> Hi,
>
> the ZimReader is fixed and I can browse through the zim file I created from
> Josch's dewiki-Dump. There were some smaller bugs in the reader as well as
> in the library, but the main bug was in cxxtools. So if you want to test
> the reader please update cxxtools.
>
> There are still some tasks to do in the reader. There is many hardcoded
> stuff from the old wikipedida DVD, like the text "DVD-ROM-Ausgabe 2007" in
> the title area and a reference to Directmedia.
>
> I also updated the status and next steps page in our wiki.
>
> Tommi
> _______________________________________________
> dev-l mailing list
> dev-l(a)openzim.org
> https://intern.openzim.org/mailman/listinfo/dev-l
--
Regards
Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens
Wikimedia CH - Association for the advancement of free knowledge
www.wikimedia.ch
Hi,
I created a openzim file from the german wikipedia dump I became from Josch
last year. The file contains all articles from the german wikipedia without
images and its size is 1,3G (or more precisely 1302052315 bytes). Generation
took only about 1:10 on our server with the new zimwriter. It could be even
improved by parallizing the compression phase, since this is CPU bound and
takes the most time, but I feel, that it is not necessary. There are more
important task to do.
You can download the file from http://www.openzim.org/download/dewiki.zim.
The zimreader (the tntnet based webapplication) is almost working with that
file. There are some bugs to fix, but this will be done soon.
Emmanuel: the file is updated. I fixed some bugs. The zimDump crashed when
reading redirects and the writer failed to generate redirects correctly.
Josch: do you have an updated dump?
Tommi
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,
I have released three "small" demo ZIM files:
* wikipedia_en_wp1_0.5_2000+_03_2006_rc1.zim, a ZIM version of the WP1
0.5 selection done 2 years ago.
* wikipedia_es_fa+ga_2000+_04_2008_alpha1.zim, with all featured and
good articles from the Wikipedia in Spanish.
* wikipedia_it_fa_1000+_-4_2008_alpha1.zim, with all featured articles
from the Wikipedia in Italian.
You can download them at http://tmp.kiwix.org/zim/
A few remarks:
* These files are alpha version and a few things are wrong with them:
not GFDL compliant, etc.
* You can read with the tntreader or the last alpha version of Kiwix
(http://tmp.kiwix.org/bin/kiwix-0.8-alpha2+xulrunner.tar.bz2).
* We do not have a reader for Windows now.
Regards
Emmanuel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAknp/6oACgkQn3IpJRpNWtNAxgCeL2xuu4PMDdRV7Bo71nmvdWEn
Dz8AoNRJBtFXa3XkfNEEikvesqN06IJt
=7vch
-----END PGP SIGNATURE-----
Dear all,
you may have noticed a short outage tonight.
The server lost its physical network link several time. We did now exchange
the patch cable. I hope the issue is now resolved.
Then we had the conference call with Erik Möller, Tomasz Finc and Brion Vibber
from Wikimedia Foundation tonight.
The WMF is interested in openZIM as a standard storage format for offline
content and offers support, eg. funding DVDs or similar.
The WMF has some plans in standardizing the distribution of content in several
ways. The ZIM format may play a major role there, but this will have to
evolve in the next one or two years. Then we might see how the plans of the
WMF work out and in which state ZIM will be at that time and what features it
has by then.
I will add the outcome later to the preparation page under
http://openzim.org/Wikimedia_Foundation_Relationship
I am happy to welcome Brion Vibber, Chief Technical Officer of the Wikimedia
Foundation, on our list.
Some unrelated news:
For our Windows instance I added RDP support as well. Feel free to use that
instead of VNC (which doesn't work so seemlessly). It works the same way -
just install "rdesktop" and tunnel port 3389 through ssh.
$ ssh -L 3389:localhost:3389 USERNAME(a)openzim.org
$ rdesktop localhost
Thanks for your support,
Manuel
--
Regards
Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens
Wikimedia CH - Association for the advancement of free knowledge
www.wikimedia.ch
Hi,
I just checked in the suggested template engine, which is used in the layout
page. To use it, you need to define a layout page, which contains a tag
<%content%>. The class zim::Article has in addition to getData a method
getPage, which returns the article embedded into that layout page.
I added a option -p into zimDump to test that feature.
"zimDump -o 12 -d myfile.zim" prints the data from article number 12 and
"zimDump -o 12 -p myfile.zim" prints the data embedded into the layout.
Tommi
Hi,
looks like I have to clearify some things about windows compatiblity of
zimlib.
Zimlib is developed on linux but not for linux. It is as far as possible
platform indipendent. To compile zimlib you first need to compile cxxtools.
Both use autoconf and automake as the build system. This is a platform
indipendend build system. The only prerequisites you need is a (bourne-
compatible) shell and some simple tools like awk, sed and make. Cxxtools
requires a c++-compiler. Zimlib depends on libz and libbz2. All of these are
easily installable by your package manager of your operating system.
As you know, windows does not fulfil these simple requirements. It does not
have a shell, nor sed or awk. It even does not have a C++-compiler or a
package manager. So it is not that easy to just "emerge libz" or "apt-get
install zlib-dev" or whatever. This is the main problem in porting zimlib to
windows. You have to generate your build stuff on your own and manually install
zlib, libbz2 and a c++-compiler. All this is freely availabe.
Ok - there is at least 2 things which won't work out of the box on windows.
These are libiconv, which is needed by cxxtools and used in zimreader and
opendir/readdir/closedir.
The libiconv stuff is not needed for zimlib. This can be excluded from the
build of cxxtools if you write your build system.
Opendir/readdir/closedir is abstracted away by cxxtools. We just need a
windows implementation.
Fortunately this opendir/readdir/closedir-stuff is adopted from pt-framework,
which has this windows implementation. It is just a matter of adopt this into
cxxtools.
Mutexes were used in libzeno but are not any more used in zimlib.
There may be some other problems, which need to be resolved, but I don't think
there are real showstoppers. The work just need to be done. Nothing really
difficult.
The patches I got from Guillaume mainly copied the pt-framework-directory-stuff
into cxxtools. The other thing was to remove the mutex stuff from libzeno. This
was really not much work. The main problem is to set up the build environment.
This was not done. A Makefile for windows would be helpful, but I haven't got
one from Guillaume. Or a document "howto setup my windows environment to
compile cxxtools and libzeno".
Tommi
Hi,
I would like to inform you, that I have reached a major milestone with the new
zim format: I created successfully a zim file and read it with zimDump.
The changes are:
* rewritten large parts
* updated the zim file format
* redesigned zimwriter
Let me say some words about these changes and why I did this.
* Rewritten large parts:
Rewrite helped me to improve code quality. With my knowledge of today and my
experience with the zeno file format, it was possible to clean up the library
code.
* Updated the zim file format:
Since we decided to leave the compatibility I rethought some parts of the zeno
file format. The zeno file format did not support clustering of articles to get
better compression. I did a minor change and added a offset and size to the
directory entry of the article. The offset to the data blob was left in the
article. But now multiple articles pointed to the same blob. In the new format
I added another datastructure: the chunk, which is a collection of blobs. We
have a pointerlist similar to the directory pointer list, which points to the
chunks. The article addresses his blob by chunk number and blob number. Also
redirect entries do not need these pointers at all. I just skipped them. This
saves some bytes for each redirect.
* Redesign zimwriter:
Now the source of articles is abstracted from the generator. Also the database
is not used any more for temporary data. The writer builds the directory
entries in memory and uses a temporary file to collect the compressed data.
This will improve performance significantly. The cavet is, that more RAM is
used, but I estimated, that we have enough even for very large zim files.
The abstraction of data source gives us the opportunity implement other
sources easier, e.g. read data from the file system or wikipedia dumps without
using the database at all.
I hope this will motivate you to go on dumping data, so that we soon can start
testing.
There is still quite some work to do for me. I need to make the zimreader
working again. And the next big task is the full text index. My plan is to
read the data from zim files directly and add the full text index to the zim
files in a separate step or optionall generate a separate zim file for the index
as it was done with the german wikipedia DVD.
Tommi
Hi,
I'm thinking about the layout of pages in zim files. I have some ideas, what to
do and I would like to share thes with you. Especially I would like to hear
your expectations about the content of zim files.
Let me explain the problem. In the german wikipedia DVD the layout of the page
was partly hardcoded into the reader. The html-frame, css, images and
javascript-files were compiled into the application. With a special namespace
'-', these files were accessable. E.g. "/-/monobookde.css" loads the css file
from the application.
It is easy to move these into the zim file.
The html-frame is more difficult, since it contains dynamic parts like
"<title>...</title>" or the actual article text. So to move this page into the
zim file, we need some placeholders, which need to be parsed at runtime.
My plan is to introduce a special syntax for these placeholders. The tag
"<%something%>" may be replaced. This "something" need to be defined. This
syntax may be used in the layout page, which is already in the zim file header
as well as in arbitrary pages. We might also add a special mime type in
addition to zimMimeTextHtml, where the zimlib parses these tags.
This "something" may be:
<%title%> title of the page
<%url%> the url of the page (e.g. /A/Linux)
<%namespace%> the namespace
<%/A/Linux%> insert another article here
<%content%> placeholder for the article content in the layout page
... (maybe more in the future)
In the zim lib we have a class "zim::Article". This has a method "getData()",
which returns the article data of the page. I would add a new method
"getPage()", which uses the layout page to return the complete page.
This layout page should only be used, when the mime type is zimMimeTextHtml.
This way the creator of the zim file can specify, how to show pages without
repeating the html-header and footer on each page. A reader may ignore this
layout if wanted.
Tommi
Dear openZIM developers team!
As I wrote last week I was at the Wikimedia Conference in Berlin.
== Wikimedia Israel / hebrew Wikipedia on OLPC ==
One result was that I meet Asaf Bartov from Israel who is working on hebrew
Wikipedia for the One Laptop Per Child project (OLPC, XO). I haven't heard
from him since I came back, but I am pretty sure we will hear from him after
the Easter holidays. I keep him up-to-date by forwarding the most relevant
mails from this mailinglist.
== Wikimedia Italia / italian Wikipedia on DVD ==
I also talked to Frieda Brioschi from Wikimedia Italia concerning their
Wikipedia DVD. They have a company in Italy which created a proprietary
software with GUI for the DVD and an "unknown" storage for the data (reminds
me of the Directmedia approach in the first place). She said the software
was "awkward". They plan to have a new DVD this year and the company promised
to write a new software which is better. Frieda stated that WMIT is not yet
sure if they will sign a contract with this company again.
After talking about openZIM Frieda said that she will take that into account.
I offered her that WMIT can make their DVD on its own by using our software
and maybe Kiwix, or they point the company to our software which they can use
and integrate zimlib or she may also ask Emmanuel - maybe he would be willing
to make a DVD for WMIT.
== Wikimedia Polska / polish Wikipedia on DVD ==
Wikimedia Polska had a DVD using HTML dumps and a Java applet as search
engine. Didn't sell, no current plans on having a new DVD.
== Wikimedia Foundation Conference Call ==
Erik Möller (Deputy Director of Wikimedia Foundation) contacted me after the
conference by mail, stating that he heard from openZIM and would like to talk
to me about that.
Quote:
"Since it seems like prior offline reader efforts are
merging into it, I would like to understand the goals and deliverables
of the project better, and also discuss whether we could support it in
any fashion."
We have now fixed a date with James Owen, his assistant, for a conference call
at Thursday evening, April 16th at 20:00 our local time (CEST).
I would like to collect some statements and ideas how Wikimedia Foundation
could help us.
To do that in a collaborative way I created
http://openzim.org/Wikimedia_Foundation_Relationship
and will now start filling in what I have in mind, but invite you to add what
you feel is important to mention from your point of view.
Thanks for your attention,
Manuel
--
Regards
Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens
Wikimedia CH - Association for the advancement of free knowledge
www.wikimedia.ch