Dear ZIM hackers,
I recently improved the Kiwix HTTP software called kiwix-serve. Small reminder: this software is a HTTP server able to deliver ZIM file contents, so it acts as a Web server. Kiwix-serve has the new ability to deal with many ZIM files at the same time (so with only one binary instance). That means: you have on a same Web server contents belonging to many different ZIM files. You have a demo here: http://library.kiwix.org
Both, ZIM files we make at Kiwix and ZIM files generated from Wikipedia, have articles HTML with absolute internal URLs. That means, in the HTML of articles, for a link pointing to the article "Wikipedia" (this is an example), we will have a URL like "/A/Wikipedia" (or "/A/Wikipedia.html" in my case, but this does not matter).
Until now, this was not a problem because we always had a "one by one" usage of ZIM files: the context was clear. But now, in my case, I need to specify with which ZIM file I want to deal. If I want to open the "Wikipedia" article in WPEN, I should have something like that: http://library.kiwix.org/wikipedia_en_all_nopic/A/Wikipedia.html
Here is the problem: I have HTML code with URLs looking like "/A/Wikipedia.html" and I need something "/wikipedia_en_all_nopic/A/Wikipedia.html". I have found a workaround by rewriting on the fly the URLs but this is a ugly solution which is absolutely not sustainable.
As far as I know, we do not have any specification relating to that. To my opinion, absolute internals URLs should be forbidden. If we continue with my example: "/wikipedia_en_all_nopic/A/Wikipedia.html" ; "wikipedia_en_all_nopic" is something decided by the kiwix-serve operator, not something that should be imposed by the ZIM publisher. So, the publisher can not assumed what could/should be the full absolute path, so should not use absolute paths for internal URLs. So, URLs should be relatives and I only see two options (I continue with my example): if you are in the same namespace, simply use "Wikipedia.html" otherwise come back to the relative root of the file "../A/Wikipedia.html".
Before starting to fill a feature request for the Mediawiki:Collection extension and patching my own ZIM generation scripts, I think we should discuss and take a decision about that (and also update afterward the specs. on the wiki). So I wait to your feedbacks.
Regards Emmanuel
Hello,
I know the problem. I had the same with the zimreader.
If you use relative URLs you have to take care about articles, which contain '/'. Look at e.g. http://de.wikipedia.org/wiki/BMW_501/502. In a zim file this will be /A/BMW_501/502. Relative urls must point to ../something or ../../A/something. If you have a link to e.g. BMW_505 it points as a relative url to /A/BWM_501/BMW_505.
But yes, relative urls are much better.
Tommi
On 20/05/2012 11:22, Tommi Mäkitalo wrote:
If you use relative URLs you have to take care about articles, which contain '/'. Look at e.g. http://de.wikipedia.org/wiki/BMW_501/502. In a zim file this will be /A/BMW_501/502. Relative urls must point to ../something or ../../A/something. If you have a link to e.g. BMW_505 it points as a relative url to /A/BWM_501/BMW_505.
Good point, this is also a point to take care about!
Emmanuel
On 20/05/2012 11:22, Tommi Mäkitalo wrote:
If you use relative URLs you have to take care about articles, which contain '/'. Look at e.g. http://de.wikipedia.org/wiki/BMW_501/502. In a zim file this will be /A/BMW_501/502. Relative urls must point to ../something or ../../A/something. If you have a link to e.g. BMW_505 it points as a relative url to /A/BWM_501/BMW_505.
But yes, relative urls are much better.
It seems to me that we agreed that it's not possible to integrated ZIM contents in a HTTP sub-hierachy (like http://domain.com/foo/bar/my_zim_content) if ZIM URLs are absolute. This is a problem, at least to achieve the creation of online libraries based on ZIM.
I have consequently revamped the paragraph about URLs of the ZIM format: https://www.openzim.org/index.php?title=ZIM_File_Format&diff=1418&ol...
The big difference is that *before* we were advising absolute URLs (at least with the example) and *now* we forbids to do so (relative URLs are mandatory).
For now, Kiwix & Wikimedia produced ZIM files are "wrong". Impact is not dramatic, because Kiwix implements a workaround (on the fly URL rewriting). But, adopting this modification will lead to the creation of at least two bug reports both Kiwix and Wikimedia.
Remarks? Oppositions?
Emmanuel
Le 05/08/2012 23:52, Emmanuel Engelhart a écrit :
On 20/05/2012 11:22, Tommi Mäkitalo wrote:
If you use relative URLs you have to take care about articles, which contain '/'. Look at e.g. http://de.wikipedia.org/wiki/BMW_501/502. In a zim file this will be /A/BMW_501/502. Relative urls must point to ../something or ../../A/something. If you have a link to e.g. BMW_505 it points as a relative url to /A/BWM_501/BMW_505.
But yes, relative urls are much better.
It seems to me that we agreed that it's not possible to integrated ZIM contents in a HTTP sub-hierachy (like http://domain.com/foo/bar/my_zim_content) if ZIM URLs are absolute. This is a problem, at least to achieve the creation of online libraries based on ZIM.
I have consequently revamped the paragraph about URLs of the ZIM format: https://www.openzim.org/index.php?title=ZIM_File_Format&diff=1418&ol...
The big difference is that *before* we were advising absolute URLs (at least with the example) and *now* we forbids to do so (relative URLs are mandatory).
For now, Kiwix & Wikimedia produced ZIM files are "wrong". Impact is not dramatic, because Kiwix implements a workaround (on the fly URL rewriting). But, adopting this modification will lead to the creation of at least two bug reports both Kiwix and Wikimedia.
Remarks? Oppositions?
I have open two bugs:
* on the Wikimedia side for the extension:collection https://bugzilla.wikimedia.org/show_bug.cgi?id=39651
* on the Kiwix side: https://sourceforge.net/tracker/?func=detail&aid=3561737&group_id=17...
Hope this will be fixed soon, URLs rewriting generate a big overhead for kiwix-serve. As the format has changed a little bit, it would be maybe good to also increment the minor version - otherwise it will difficult to know how are the internal urls (relative or absolute).
Regards Emmanuel