Re: [openZIM dev-l] Kiwix and ZimReader/Writer Index Format

24 Aug 2009

Hi,

Tommi Mäkitalo schrieb:
...
  When defining zim, I followed the original zeno
format. 
It seems reasonable to do that on a first shot.

...
  The directory entry in a zim (or zeno) has a field,
which is named
 "title".  
Yes, I had a look at that earlier.

...
  It can contain arbitrary UTF-8 characters. The
directory entries are
 actually
 sorted by that field just to be able to find quickly the article of a
 specific
 title. As you say it follows the mediawiki logic. 
To hard code the mediawiki logic sure is a good start to
reach the primary goal of making the world's mediawikis
e.g. wikipedias available for offline use.

To support other sources though, their native addressing
currently has to be "mediawikinized". This might have
some implications as further potential sources are
considered.

Lots of web content is stored - and as such addressed -
in a hierarchical manner, that is in a subdirectory
structure. One prominent example is the perl
documentation [1]. Twiki also allows for structured
addressing.

There's no guarantee, that you won't find an article
title more than once in different articles of such a
collection. I have seen them but can't remember
where :-(

For non-wiki content it's up to the author to supply
unique titles for his HTML content over the entire
collection. I strongly doubt, this is alway done,
even with machine generated contents.

So in such a cases it would be an awful lot of work
to create a decent ZIM archive conforming to the
mediawiki logic.

...
  I don't see any problem in that. If I sort with a
other string, which
 has no
 direct association with the content, I won't win anything.  
Yes! The benefit is, that a user can access any article
the way he is used to from his online content.

Online content with at

  <onlineroot>/Openzim_File_Format
  <onlineroot>/OpenzimFileFormat
  <onlineroot>/standards/fileformats/OpenzimFileFormat.html

can be found as off-line content at

  <zimreaderroot>/A/Openzim_File_Format
  <zimreaderroot>/A/OpenzimFileFormat
  <zimreaderroot>/A/standards/fileformats/OpenzimFileFormat.html

Any link to the online source can easily be turned
into a link to the offline content by replacing
"<onlineroot>" by "<zimreaderroot>/A" in the above
example.

You never need to worry about duplicate or missing
titles when creating an archive.

Most of the online content can be archived as is
when it already uses relative links.

...
  I just need another index, which helps me finding
articles
 with a specific title. Isn't that what the full text index is for. How else,
but by using the search function or the above addressing,
can a user navigate to a specific title / article at the
moment? So you shouldn't need an extra index :-)

The full text index might have to be optimized a bit for
that (just guessing)
- entries for multiple word titles might have to be added
- if a word is a title, articles with that title have to
  be at the top of the result list.
...
  Also I don't want to put the title into the
article itself. This
 requires the
 reader to put some logic into the content and parse it. Currently the
 reader
 (my zimreader) just reads the content and passes it to the browser. It
 does
 not need to know anything about the actual content. 
That's what I assumed and I support that. I think the
ZIM format should supply a basic level of abstraction
by offering necessary header fields, living with the
redundancy caused by that.

Cheers,
Marc

[1] http://perldoc.perl.org/perldoc-html.tar.gz

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [openZIM dev-l] Kiwix and ZimReader/Writer Index Format