Hi,
Currently the wikipedia dumps are stored in single zim file. Their size is already over the 2GB for the english wikipedia, and over 4GB for some versions with images included. Many devices don´t support files of that size, typically their file size limit is 2GB or 4GB. The 2GB limit is due to the use of signed 32-bit types in file access and unfortunately not that uncommon. For example Symbian2 (and earlier versions), the iostream in windows (see also http://bugs.openzim.org//show_bug.cgi?id=19), old linux versions, and possibly Android [1] don´t support files larger than 2GB. Others OS (including Symbian3, or Maemo) do support it, but still in many cases there is a 4GB limit due to FAT32 file system, which is the standard files system for SD cards, and also for internal memory of most mobile phones. Some of them, like Maemo or Android?, support use of other file systems which don´t have this limit, but this requires reformatting the memory card, and makes the card unreadable for many other devices. Therefore also for these cases another solution would make sense.
The question is how to support devices which have the 2GB or 4GB limit. The following options come to my mind: 1. Split files on file system level with special naming convention. (e.g. *.0.zim.*.1.zim etc..) The zim format is unchanged, the zim library has to be extended to support this. Advantage: Relative simple change to zimlib (only replace iostream implementation) End user can split files relatively easily Disadvantage: Not a really clean solution. 2. Split files in valid zim files with separate headers. Store in all zim files (e.g. in metatdata (relation?)) names of related files. Advantage: Clean solution Allows other features as well, e.g. separating images and text into separate files. Disadvantage: Larger change to zim file format. Possibly larger change to zimlib (or application if handled in metadata) 3. Split in valid zim files with separate headers. No changes to zim file format or zimlib, application using zimlib has to load all related files (and find out which are related) appropriately. Advantage: No change to zimlib Disadvantage: Application has to handle this, Difficult for end user to split file Not convention how to detect related files (In worst case user has to open all separately) => Problematic if split file are to be provided. 4. Other ideas?
For all options it is possible to directly provide the split files (thus in future mediawiki would directly write out 2GB zim files) , or to let the end user to do this. I´d definitely prefer if split files are provided.
What is your opinion on this? For the WikiOnBoard I´d have to solve this soon (in particular if the next german wiki zim is larger than 2GB ;) and therefore I´d need to implement something on my own if there is no agreed solution. However, I´d strongly prefer if there is an agreement for such a common solution.
Best regards, Christian [1] http://osdir.com/ml/android-porting/2010-03/msg00107.html
Hi Christian,
zimlib has already a own iostream implementation so there is no limitation there. But I see the problem with file systems, which may limit the file size.
I am willing to implement whatever solution we prefer. Larger changes in zimlib is not an issue, which should affect our decision. I want to have the best possible solution - not the easies to implement.
The old zeno implementation as well as early implementations of zimlib had a feature to support multiple files. Instead of opening a single zim (or zeno) file it was able to open all zim files in one directory.
The last german wikipedia DVD had separate files for text and images as well as a separate DVDs for images with higher resolution. Just by copying the files into one directory it was possible to access all content from all zeno files.
I removed the feature for better portability. There is no standard feature to read all files in one directory.
Your suggestions 2 and 3 imply, that the creator of the zim file needs to address the problem. He has to split the content. Of course the zimwriter can help by automatically split the file. But handling zim files will get more difficult. The user has to know, which files belong together and if he downloads zim files, he has to download multiple files. It implies, that we must extend the specification to limit the file size to 2G. I don't like that. I don't want to limit it per spec.
The solution 1 has really the advantage, that the user can download a single zim file and split himself when needed. There is even a unix/linux-tool to split files into pieces named split (isn't it nice how intuitive unix really is ;-) ).
It is quite easy to extend the iostream to support multiple files, so that it internally join the files into one zim file. We just have to think about the interface, how to tell zimlib which files to join.
As you suggested a naming convention is one possible solution. We may even use the schema from split. So if you split foo.zim into parts, the parts are named foo.zimaa, foo.zimab, foo.zimac and so on. If you tell zimlib to open file foo.zim and it is not found, it looks for the parts until it do not find any more.
The user can split the files as needed and join then back using cat. Very easy.
Tommi