I was curious if you include images? If not, are you considering doing so, and what's stopping you? If so, how do you pick them?
we haven't done this on our "Test DVD" this summer even though this is easily possible. Well, easily means: It is easy for the format, but the problem is to choose the images. Emmanuel Engelhart (Kiwix, he is also part of the openZIM team) has made some perl scripts for that.
We didn't do it for two reasons: Lack of time, because even though the tools exist, it's a lot of work. Searching through the articles, get all the image URLs, get the images, decide in which size to resize them etc...
I could help with generating the list of images to pick, I've done that work before for the OLPC activity. I use traffic stats. Traffic stats (when used appropriately) work quite well for picking which articles or images to include.
Ben Schwartz can help too, I believe he was responsible for automatically acquiring and resizing images (and even converting svg to jpg). He's the other major contributor to the OLPC activity that still has interest in the general goal of offline Wikipedias.
And the openZIM project is not a publisher of offline content. We are developing a stable, efficient format allowing free interchange of contents between reader applications and devices and providing a GPL'ed sample implementation of it.
So should I be talking to someone else? Who should I talk to?
Well, we had a "Offline Meeting" at Wikimania in Buenos Aires this summer where Samuel Klein was also participating. Our goal is to contribute the right technology to make all the offline projects able to collaborate. Currently everyone is reinventing the wheel when it comes to storage of the content.
Unfortunately, SJ had very little to do with the actual program, which ended up being created by volunteers not on the wikibrowse mailing list.
We think that the specific knowledge of the publishers should be how to select the content - which content goes where in which form - and not technical questions such as compression, storage or retrieving the data on the user's end.
OK, if I shouldn't be talking to you guys, tell me who to talk to.
Yes, selecting content is very difficult. I couldn't get Peru or SJ to contribute meaningfully to generating a simple blacklist of articles that should NOT be included on the OLPC activity. (Recall it is being given to young children!) I ended up making the blacklist myself based on my own gut feelings. If Peru's board of education or OLPC's "director of content" couldn't get their act together for this simple task, expecting others to do this task for you will be a huge roadblock to getting content out.
Traffic based content is simple and effective and it doesn't involve a lot of opinions on what should or should not be included.
- Madeleine