Hi,
Tommi Mäkitalo schrieb:
When defining zim, I followed the original zeno format.
It seems reasonable to do that on a first shot.
The directory entry in a zim (or zeno) has a field, which is named "title".
Yes, I had a look at that earlier.
It can contain arbitrary UTF-8 characters. The directory entries are actually sorted by that field just to be able to find quickly the article of a specific title. As you say it follows the mediawiki logic.
To hard code the mediawiki logic sure is a good start to reach the primary goal of making the world's mediawikis e.g. wikipedias available for offline use.
To support other sources though, their native addressing currently has to be "mediawikinized". This might have some implications as further potential sources are considered.
Lots of web content is stored - and as such addressed - in a hierarchical manner, that is in a subdirectory structure. One prominent example is the perl documentation [1]. Twiki also allows for structured addressing.
There's no guarantee, that you won't find an article title more than once in different articles of such a collection. I have seen them but can't remember where :-(
For non-wiki content it's up to the author to supply unique titles for his HTML content over the entire collection. I strongly doubt, this is alway done, even with machine generated contents.
So in such a cases it would be an awful lot of work to create a decent ZIM archive conforming to the mediawiki logic.
I don't see any problem in that. If I sort with a other string, which has no direct association with the content, I won't win anything.
Yes! The benefit is, that a user can access any article the way he is used to from his online content.
Online content with at
<onlineroot>/Openzim_File_Format <onlineroot>/OpenzimFileFormat <onlineroot>/standards/fileformats/OpenzimFileFormat.html
can be found as off-line content at
<zimreaderroot>/A/Openzim_File_Format <zimreaderroot>/A/OpenzimFileFormat <zimreaderroot>/A/standards/fileformats/OpenzimFileFormat.html
Any link to the online source can easily be turned into a link to the offline content by replacing "<onlineroot>" by "<zimreaderroot>/A" in the above example.
You never need to worry about duplicate or missing titles when creating an archive.
Most of the online content can be archived as is when it already uses relative links.
I just need another index, which helps me finding articles with a specific title.
Isn't that what the full text index is for. How else, but by using the search function or the above addressing, can a user navigate to a specific title / article at the moment? So you shouldn't need an extra index :-)
The full text index might have to be optimized a bit for that (just guessing) - entries for multiple word titles might have to be added - if a word is a title, articles with that title have to be at the top of the result list.
Also I don't want to put the title into the article itself. This requires the reader to put some logic into the content and parse it. Currently the reader (my zimreader) just reads the content and passes it to the browser. It does not need to know anything about the actual content.
That's what I assumed and I support that. I think the ZIM format should supply a basic level of abstraction by offering necessary header fields, living with the redundancy caused by that.
Cheers, Marc