Hi Marc Le mar 11/08/09 16:26, "Marc Bantle" openmoko@rcie.de a écrit:
I observed that Kiwix is producing an "ad-hoc" type index. This may be usefull for desktops as they have the power to generate an index file on the fly. On small footprint devices this will not reasonably be possible, due to lacking memory and cpu resources.
Yes
Even on a dual core desktop with 3.5 GB of memory Kiwix failed to produce "ad-hoc" index of the openzim-edition of the German Wikipedia running out of memory after many hours.
Yes, this is Kiwix's specific issue with really big (with a lot of text) ZIM files.
Question 1:
From the change log I see that kiwix is using
a prominent search engine (Xapian) instead of the mechanism ZimReader/Writer are using. Is there an easy way to reuse an index produced by Kiwix on a different machines?
Yes although this is not trivial, the Xapian database is in your ~/.www.kiwix.org directory ([md5sum].index directory) You can copy it to every other profile/account/computer and like that Kiwix will be able to search trough a ZIM without running the indexing process. To reduce the size of the directory, you can also use "xapian-compact".
We know, they are a list of improvements to do to improve the current index management usability.
Question 2: Are there plans to enable Kiwix to read reusable indexes of the format released for ZimReader/ Writer?
I have nothing against to make Kiwix compatible with different search engine backends... but this is not a priority yet for me. I think I will do it in a middle far future, as soon as I have time for that or if for any reason a user really need that.
Question 3: Are there plans to enable Kiwix to produce such a reusable index.
Not sure to understand the question? Do you speak from the ZIM indexes ? In this case, cf. Question2 comment.
Question 4: Wouldn't it be desirable to deliver reusable indexes together with zim-article-databases for all those people with less capable devices (mids, netbooks, phones) on the Kiwix site?
I do not believe having only one type of search engine is good at all: usages are multiple and for this reason with have different search engines. I think the ZIM format should not forced the user to make a choice. I also think, we have to be able to spread contents without data twice (with indexes). And finaly, I do not think that having compatible indexes should be a priority because 99% of the users don't care about that (they simply use only one client).
Question 5: The zim databases supplied on the Kiwix site [1] seem to use the articles title field as article id field, which - I'm sure - solves some problems for Kiwix, but results in a list of article ids as result of a search on zimreader instead of a list of article titles. Since both Kiwix and ZimReader are part of the openzim standardization effort, this confuses me a bit. Which format is supposed to be the standard?
Tommi already answered to that...
IMO this is the job of the indexer to find the title... if there is a HTML page with a title, it has to use it.
More globaly, IMO forcing ZIM creators with url=title is a bad idea, we never should forced (with the format) to adopt special way of representig/storing Informations. All what if possible with "normal" HTTP/HTML should be also possible with contents in a ZIM file.
Emmanuel
Hi Emmanuel,
sorry for replying so late.
emmanuel@engelhart.org schrieb:
Question 1:
From the change log I see that kiwix is using
a prominent search engine (Xapian) instead of the mechanism ZimReader/Writer are using. Is there an easy way to reuse an index produced by Kiwix on a different machines?
Yes although this is not trivial, the Xapian database is in your ~/.www.kiwix.org directory ([md5sum].index directory) You can copy it to every other profile/account/computer and like that Kiwix will be able to search trough a ZIM without running the indexing process. To reduce the size of the directory, you can also use "xapian-compact".
Good to know :-)
We know, they are a list of improvements to do to improve the current index management usability.
Question 2: Are there plans to enable Kiwix to read reusable indexes of the format released for ZimReader/ Writer?
I have nothing against to make Kiwix compatible with different search engine backends... but this is not a priority yet for me. I think I will do it in a middle far future, as soon as I have time for that or if for any reason a user really need that.
That's fair. I was just curious with respect to porting to arm or similar less capable architectures.
Question 3: Are there plans to enable Kiwix to produce such a reusable index.
Not sure to understand the question? Do you speak from the ZIM indexes ? In this case, cf. Question2 comment.
When integrating index search into kiwix you want to integrate index generation (indexer) as well. Did I get you right?
Since zimlib already does the index search and is an essential part of kiwix it might be a good compromise to first enable kiwix for searching a zim index, then in the "middle far future" integrate zim index generation as well. Just a thought!
Question 4: Wouldn't it be desirable to deliver reusable indexes together with zim-article-databases for all those people with less capable devices (mids, netbooks, phones) on the Kiwix site?
I do not believe having only one type of search engine is good at all: usages are multiple and for this reason with have different search engines. I think the ZIM format should not forced the user to make a choice.
Agreed! Open source is about choice - at least somtimes ;-).
I also think, we have to be able to spread contents without data twice (with indexes).
Agreed, as long as the user has the means to produce the index.
Question 5: The zim databases supplied on the Kiwix site [1] seem to use the articles title field as article id field, which - I'm sure - solves some problems for Kiwix, but results in a list of article ids as result of a search on zimreader instead of a list of article titles. Since both Kiwix and ZimReader are part of the openzim standardization effort, this confuses me a bit. Which format is supposed to be the standard?
The question aimed at the state and stability of standard ZIM archive format. When a field "title" is read via API methode getTitle(), but delivers a surrogate id, this is confusing for archive creators an developers alike. Are you arguing in favor of changing the format?
That's what my question about "New Header Fields" in the other email [1] was about.
Tommi already answered to that...
IMO this is the job of the indexer to find the title... if there is a HTML page with a title, it has to use it.
It could even be done by zimreader before displaying the search results. Then titles don't have to be stored in the index.
I can see your point of avoiding redundant titles in the ZIM archive. But isn't the introduction of new machine generated data into the archive just as bad? What are the end user benefits of a surrogate id? On the contrary users can't directly address articles by URL the way they do on their favoriteonline content.
While the title can be easily derived from items of mime type "text/html", it might be useful to supply title information for other mime types not capable of storing a title ("text/plain" or even image/xxx). Such information could be used to present any kind of lists and enable searching. When looking at current wikipedia sources though, I'm wondering how this information could reliably be gathered: Often description fields are inconsistently supplied. How would I go about multiple language description? So trying to supply titles in the header field doesn't seem to be that easy for most mime types. For HTML it can be a valid abstraction (application doesn't need to be able to read specific contents) and optimization measure.
More globaly, IMO forcing ZIM creators with url=title is a bad idea,
Basically that's what MediaWiki does. Why should it be a bad idea for it's static off-line pendant?
But then: Does openzim only want to support MediaWiki. I would hope not. I'd like to see at least basic support for other wiki contents. They might use non-title URLs or at least do different title-to-url conversions (e.g. twiki). I assume that's what you`re concerned about.
we never should forced (with the format) to adopt special way of representig/storing Informations.
I agree about the representing, not though about the storing: Isn't that what file format standards are about: forcing the way information is stored and ensuring semantics?
It took me quite some time to get halfway decided about the above and I wouldn't rule out to be proven wrong. At the risk of getting flamed, I'd like to summarize what I learned from the discussion:
1. The header field "title" and the related API methodes getTitle() should be renamed to something like "name", "id" or "address" to reflect the fact that the field is exclusively used for addressing the item. The field should contain whatever the source (wiki) uses for addressing (title as URL, filename, wikiname, ...), to allow "native" addressing for users.
2. For rendering zimreader/zimlib/zimwriter should retreive the title from the HTML content or an additional header field [2] to support arbitrary sources (wikis).
3. Kiwix archives should conform to 1.
Further header fields might be added later including a title field as necessary and as you already proposed in [2].
I'd really love to see openzim succeed as a stable standard providing a rich choice of contents.
Cheers, Marc
[1] https://intern.openzim.org/pipermail/dev-l/2009-August/000147.html [2] http://bugs.openzim.org/show_bug.cgi?id=4
Hi,
there is actually a disagreement between emmanuel and me about the role of that title.
When defining zim, I followed the original zeno format.
The directory entry in a zim (or zeno) has a field, which is named "title". It can contain arbitrary UTF-8 characters. The directory entries are actually sorted by that field just to be able to find quickly the article of a specific title. As you say it follows the mediawiki logic.
I don't see any problem in that. If I sort with a other string, which has no direct association with the content, I won't win anything. I just need another index, which helps me finding articles with a specific title.
Also I don't want to put the title into the article itself. This requires the reader to put some logic into the content and parse it. Currently the reader (my zimreader) just reads the content and passes it to the browser. It does not need to know anything about the actual content.
Tommi
Hi,
Tommi Mäkitalo schrieb:
When defining zim, I followed the original zeno format.
It seems reasonable to do that on a first shot.
The directory entry in a zim (or zeno) has a field, which is named "title".
Yes, I had a look at that earlier.
It can contain arbitrary UTF-8 characters. The directory entries are actually sorted by that field just to be able to find quickly the article of a specific title. As you say it follows the mediawiki logic.
To hard code the mediawiki logic sure is a good start to reach the primary goal of making the world's mediawikis e.g. wikipedias available for offline use.
To support other sources though, their native addressing currently has to be "mediawikinized". This might have some implications as further potential sources are considered.
Lots of web content is stored - and as such addressed - in a hierarchical manner, that is in a subdirectory structure. One prominent example is the perl documentation [1]. Twiki also allows for structured addressing.
There's no guarantee, that you won't find an article title more than once in different articles of such a collection. I have seen them but can't remember where :-(
For non-wiki content it's up to the author to supply unique titles for his HTML content over the entire collection. I strongly doubt, this is alway done, even with machine generated contents.
So in such a cases it would be an awful lot of work to create a decent ZIM archive conforming to the mediawiki logic.
I don't see any problem in that. If I sort with a other string, which has no direct association with the content, I won't win anything.
Yes! The benefit is, that a user can access any article the way he is used to from his online content.
Online content with at
<onlineroot>/Openzim_File_Format <onlineroot>/OpenzimFileFormat <onlineroot>/standards/fileformats/OpenzimFileFormat.html
can be found as off-line content at
<zimreaderroot>/A/Openzim_File_Format <zimreaderroot>/A/OpenzimFileFormat <zimreaderroot>/A/standards/fileformats/OpenzimFileFormat.html
Any link to the online source can easily be turned into a link to the offline content by replacing "<onlineroot>" by "<zimreaderroot>/A" in the above example.
You never need to worry about duplicate or missing titles when creating an archive.
Most of the online content can be archived as is when it already uses relative links.
I just need another index, which helps me finding articles with a specific title.
Isn't that what the full text index is for. How else, but by using the search function or the above addressing, can a user navigate to a specific title / article at the moment? So you shouldn't need an extra index :-)
The full text index might have to be optimized a bit for that (just guessing) - entries for multiple word titles might have to be added - if a word is a title, articles with that title have to be at the top of the result list.
Also I don't want to put the title into the article itself. This requires the reader to put some logic into the content and parse it. Currently the reader (my zimreader) just reads the content and passes it to the browser. It does not need to know anything about the actual content.
That's what I assumed and I support that. I think the ZIM format should supply a basic level of abstraction by offering necessary header fields, living with the redundancy caused by that.
Cheers, Marc