Hi Emmanuel,
sorry for replying so late.
emmanuel@engelhart.org schrieb:
Question 1:
From the change log I see that kiwix is using
a prominent search engine (Xapian) instead of the mechanism ZimReader/Writer are using. Is there an easy way to reuse an index produced by Kiwix on a different machines?
Yes although this is not trivial, the Xapian database is in your ~/.www.kiwix.org directory ([md5sum].index directory) You can copy it to every other profile/account/computer and like that Kiwix will be able to search trough a ZIM without running the indexing process. To reduce the size of the directory, you can also use "xapian-compact".
Good to know :-)
We know, they are a list of improvements to do to improve the current index management usability.
Question 2: Are there plans to enable Kiwix to read reusable indexes of the format released for ZimReader/ Writer?
I have nothing against to make Kiwix compatible with different search engine backends... but this is not a priority yet for me. I think I will do it in a middle far future, as soon as I have time for that or if for any reason a user really need that.
That's fair. I was just curious with respect to porting to arm or similar less capable architectures.
Question 3: Are there plans to enable Kiwix to produce such a reusable index.
Not sure to understand the question? Do you speak from the ZIM indexes ? In this case, cf. Question2 comment.
When integrating index search into kiwix you want to integrate index generation (indexer) as well. Did I get you right?
Since zimlib already does the index search and is an essential part of kiwix it might be a good compromise to first enable kiwix for searching a zim index, then in the "middle far future" integrate zim index generation as well. Just a thought!
Question 4: Wouldn't it be desirable to deliver reusable indexes together with zim-article-databases for all those people with less capable devices (mids, netbooks, phones) on the Kiwix site?
I do not believe having only one type of search engine is good at all: usages are multiple and for this reason with have different search engines. I think the ZIM format should not forced the user to make a choice.
Agreed! Open source is about choice - at least somtimes ;-).
I also think, we have to be able to spread contents without data twice (with indexes).
Agreed, as long as the user has the means to produce the index.
Question 5: The zim databases supplied on the Kiwix site [1] seem to use the articles title field as article id field, which - I'm sure - solves some problems for Kiwix, but results in a list of article ids as result of a search on zimreader instead of a list of article titles. Since both Kiwix and ZimReader are part of the openzim standardization effort, this confuses me a bit. Which format is supposed to be the standard?
The question aimed at the state and stability of standard ZIM archive format. When a field "title" is read via API methode getTitle(), but delivers a surrogate id, this is confusing for archive creators an developers alike. Are you arguing in favor of changing the format?
That's what my question about "New Header Fields" in the other email [1] was about.
Tommi already answered to that...
IMO this is the job of the indexer to find the title... if there is a HTML page with a title, it has to use it.
It could even be done by zimreader before displaying the search results. Then titles don't have to be stored in the index.
I can see your point of avoiding redundant titles in the ZIM archive. But isn't the introduction of new machine generated data into the archive just as bad? What are the end user benefits of a surrogate id? On the contrary users can't directly address articles by URL the way they do on their favoriteonline content.
While the title can be easily derived from items of mime type "text/html", it might be useful to supply title information for other mime types not capable of storing a title ("text/plain" or even image/xxx). Such information could be used to present any kind of lists and enable searching. When looking at current wikipedia sources though, I'm wondering how this information could reliably be gathered: Often description fields are inconsistently supplied. How would I go about multiple language description? So trying to supply titles in the header field doesn't seem to be that easy for most mime types. For HTML it can be a valid abstraction (application doesn't need to be able to read specific contents) and optimization measure.
More globaly, IMO forcing ZIM creators with url=title is a bad idea,
Basically that's what MediaWiki does. Why should it be a bad idea for it's static off-line pendant?
But then: Does openzim only want to support MediaWiki. I would hope not. I'd like to see at least basic support for other wiki contents. They might use non-title URLs or at least do different title-to-url conversions (e.g. twiki). I assume that's what you`re concerned about.
we never should forced (with the format) to adopt special way of representig/storing Informations.
I agree about the representing, not though about the storing: Isn't that what file format standards are about: forcing the way information is stored and ensuring semantics?
It took me quite some time to get halfway decided about the above and I wouldn't rule out to be proven wrong. At the risk of getting flamed, I'd like to summarize what I learned from the discussion:
1. The header field "title" and the related API methodes getTitle() should be renamed to something like "name", "id" or "address" to reflect the fact that the field is exclusively used for addressing the item. The field should contain whatever the source (wiki) uses for addressing (title as URL, filename, wikiname, ...), to allow "native" addressing for users.
2. For rendering zimreader/zimlib/zimwriter should retreive the title from the HTML content or an additional header field [2] to support arbitrary sources (wikis).
3. Kiwix archives should conform to 1.
Further header fields might be added later including a title field as necessary and as you already proposed in [2].
I'd really love to see openzim succeed as a stable standard providing a rich choice of contents.
Cheers, Marc
[1] https://intern.openzim.org/pipermail/dev-l/2009-August/000147.html [2] http://bugs.openzim.org/show_bug.cgi?id=4