Re: [openZIM dev-l] Kiwix and ZimReader/Writer Index Format

23 Aug 2009


      Hi Emmanuel,
sorry for replying so late.
emmanuel@engelhart.org schrieb:
...
...
Question 1:
...
From the change log I see that kiwix is using
a prominent search engine (Xapian) instead of the
mechanism ZimReader/Writer are using. Is there an
easy way to reuse an index produced by Kiwix on
a different machines?
Yes although this is not trivial, the Xapian database is in your ~/.www.kiwix.org directory ([md5sum].index directory)
You can copy it to every other profile/account/computer and like that Kiwix will be able to search trough a ZIM without running the indexing process. To reduce the size of the directory, you can also use "xapian-compact".
Good to know :-)
...
We know, they are a list of improvements to do to improve the current index management usability.
...
Question 2:
Are there plans to enable Kiwix to read reusable
indexes of the format released for ZimReader/
Writer?
I have nothing against to make Kiwix compatible with different search engine backends... but this is not a priority yet for me.
I think I will do it in a middle far future, as soon as I have time for that or if for any reason a user really need that.
That's fair. I was just curious with respect to porting
to arm or similar less capable architectures.
...
...
Question 3:
Are there plans to enable Kiwix to produce such a
reusable index.
Not sure to understand the question? Do you speak from the ZIM indexes ?
In this case, cf. Question2 comment.
When integrating index search into kiwix you want to
integrate index generation (indexer) as well. Did I get
you right?
Since zimlib already does the index search and is an
essential part of kiwix it might be a good compromise
to first enable kiwix for searching a zim index, then in
the "middle far future" integrate zim index generation
as well. Just a thought!
...
...
Question 4:
Wouldn't it be desirable to deliver reusable indexes
together with zim-article-databases for all those
people with less capable devices (mids, netbooks,
phones) on the Kiwix site?
I do not believe having only one type of search engine is good at all: usages are multiple and for this reason with have different search engines. I think the ZIM format should not forced the user to make a choice.
Agreed! Open source is about choice - at least somtimes ;-).
...
I also think, we have to be able to spread contents without data twice (with indexes).
Agreed, as long as the user has the means to produce the index.
...
...
Question 5:
The zim databases supplied on the Kiwix site [1]
seem to use the articles title field as article id field,
which - I'm sure - solves some problems for Kiwix,
but results in a list of article ids as result of a search
on zimreader instead of a list of article titles. Since
both Kiwix and ZimReader are part of the openzim
standardization effort, this confuses me a bit. Which
format is supposed to be the standard?
The question aimed at the state and stability of standard
ZIM archive format. When a field "title" is read via API
methode getTitle(), but delivers a surrogate id, this is 
confusing for archive creators an developers alike. Are
you arguing in favor of changing the format?
That's what my question about "New Header Fields" in the
other email [1] was about.
...
Tommi already answered to that...
IMO this is the job of the indexer to find the title... if there is a HTML page with a title, it has to use it.
It could even be done by zimreader before displaying the
search results. Then titles don't have to be stored in the index.
I can see your point of avoiding redundant titles in the ZIM
archive. But isn't the introduction of new machine generated
data into the archive just as bad? What are the end user
benefits of a surrogate id? On the contrary users can't
directly address articles by URL the way they do on their
favoriteonline content.
While the title can be easily derived from items of mime type
"text/html", it might be useful to supply title information for
other mime types not capable of storing a title ("text/plain"
or even image/xxx).  Such information could be used to
present any kind of lists and enable searching. When looking
at current wikipedia sources though, I'm wondering how this
information could reliably be gathered: Often description
fields are inconsistently supplied. How would I go about
multiple language description? So trying to supply titles
in the header field doesn't seem to be that easy for most
mime types. For HTML it can be a valid abstraction
(application doesn't need to be able to read specific contents)
and optimization measure.
...
More globaly, IMO forcing ZIM creators with url=title is a bad idea,
Basically that's what MediaWiki does. Why should it be a
bad idea for it's static off-line pendant?
But then: Does openzim only want to support MediaWiki.
I would hope not. I'd like to see at least basic support
for other wiki contents. They might use non-title URLs
or at least do different title-to-url conversions (e.g. twiki).
I assume that's what you`re concerned about.
...
we never should forced (with the format) to adopt special way of representig/storing Informations.
I agree about the representing, not though about the storing:
Isn't that what file format standards are about: forcing the way
information is stored and ensuring semantics?
It took me quite some time to get halfway decided about the
above and I wouldn't rule out to be proven wrong. At the risk
of getting flamed, I'd like to summarize what I learned from
the discussion:
1. The header field "title" and the related API methodes
   getTitle() should be renamed to something like "name",
   "id" or "address" to reflect the fact that the field
   is exclusively used for addressing the item. The field
   should contain whatever the source (wiki) uses for
   addressing (title as URL, filename, wikiname, ...),
   to allow "native" addressing for users.
2. For rendering zimreader/zimlib/zimwriter should retreive
   the title from the HTML content or an additional header
   field [2] to support arbitrary sources (wikis).
3. Kiwix archives should conform to 1.
Further header fields might be added later including a title
field as necessary and as you already proposed in [2].
I'd really love to see openzim succeed as a stable standard
providing a rich choice of contents.
Cheers,
Marc
[1] https://intern.openzim.org/pipermail/dev-l/2009-August/000147.html
[2] http://bugs.openzim.org/show_bug.cgi?id=4

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [openZIM dev-l] Kiwix and ZimReader/Writer Index Format