How important is ZIM support in Collections?

List overview All Threads
Download

newer

older

Best script for static html of a...

Kiwix for Android Google Code-in...

Erik Moeller

12 Nov 2013 12 Nov '13

9:51 p.m.

Hi offline community,

how important is ZIM support in Collections (the "Create a book" feature) on Wikimedia sites? We implemented this a while ago to support offline efforts. Since collections are still typically very much limited in size, it's not a very viable option for huge offline exports, more for batches of articles on related topics. Do people currently rely on this functionality for offline deployments?

We're re-implementing the rendering pipeline for Collections to ensure long-term maintainability, and our default would be to eliminate initially all formats except for PDF if we don't absolutely have to support them. I'll see if we can get some metrics on current ZIM file usage via the Collection extension, but it'd be nice to get qualitative feedback as well.

(More background at: https://www.mediawiki.org/wiki/PDF_rendering )

Thanks, Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Show replies by date

Federico Leva (Nemo)

13 Nov 13 Nov

1:59 a.m.

Erik Moeller, 13/11/2013 06:51:

...

We're re-implementing the rendering pipeline for Collections to ensure long-term maintainability, and our default would be to eliminate initially all formats except for PDF if we don't absolutely have to support them. I'll see if we can get some metrics on current ZIM file usage via the Collection extension, but it'd be nice to get qualitative feedback as well.

(More background at: https://www.mediawiki.org/wiki/PDF_rendering )

All including ePub (cc Wikisource-l).

Nemo

Emmanuel Engelhart

1:59 a.m.

Hi Erik

This is great to see you speaking about this now.

Le 13/11/2013 06:51, Erik Moeller a écrit :

...

how important is ZIM support in Collections (the "Create a book" feature) on Wikimedia sites? We implemented this a while ago to support offline efforts. Since collections are still typically very much limited in size, it's not a very viable option for huge offline exports, more for batches of articles on related topics. Do people currently rely on this functionality for offline deployments?

Kiwix, as a project, does not rely directly on the WM ZIM export, but many of our users do. Of course they suffer of the limitations of the current solution and frankly: most of them are not aware of this feature.

So, this would be for us an impairment. But, I agree something should be done. IMO we should somehow try to get 3 important output formats: * PDF (adapted for really small collections) * EPUB (the most used free ebook format) * ZIM (for bigger collections)

...

We're re-implementing the rendering pipeline for Collections to ensure long-term maintainability, and our default would be to eliminate initially all formats except for PDF if we don't absolutely have to support them. I'll see if we can get some metrics on current ZIM file usage via the Collection extension, but it'd be nice to get qualitative feedback as well.

(More background at: https://www.mediawiki.org/wiki/PDF_rendering )

I have also seen your email to wikiteck-l: http://lists.wikimedia.org/pipermail/wikitech-l/2013-November/073059.html

I think the choice of Parsoid as a rendering backend is a really good one for PDF. I have always been advocating the HTML2PDF approach. I also think that Parsoid delivers the mandatory information to hack the HTML correctly and adapt it for offline usage.

That's why I have been working since March on a solution called mwoffliner (also using nodejs, like Parsoid): https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.j...

Mwoffliner: 1 - Download a selection of articles from the Parsoid API 2 - Rewrite the HTML code 3 - Write the ZIM file (not yet implemented, files are written on the filesystem)

You can have a idea of the rendering with this whole WPRU collection (ZIM file served with kiwix-serve): http://library.kiwix.org/wikipedia_ru_all

If I correctly understand, points 1&2 are similar to what you plan to do for the new PDF pipeline. So, this would be great to collaborate on this and maintain the ZIM output. How does it sounds?

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Bjoern Hassler

3:42 a.m.

Hi Erik, Frederico, Emmanuel, hi all,

I just wanted to give some input on this. We're are using mediawiki to provide teacher education resources for teachers in sub-Saharan Africa, see http://www.oer4schools.org. (The big picture is the 2nd MDG of achieving Universal Primary Education.) We chose mediawiki because it's an open platform, that has a lot of momentum behind it, and because we can produce pdf and offline versions, which is essential for us. The new visual editor is also an excellent development.

Here are some of the things that are important to us. (Seeing as this is a longer email, I've put this onto my blog as well, see http://bjohas.de/Blog, with a bit more formatting, and some additional notes.)

= Issues with current PDF generation =

Seeing as this thread is about pdf, I'll start with some issues around the pdf:

* We are essentially writing educational materials, and would like a way of putting text in boxes to flag the nature of that text (e.g. as a transcript, background reading, a note meant for facilitators). We have implemented this quite straight forwardly through div with a border or different background colour. However, because the current pdf rendering uses the wiki text (rather than html) all this formatting is lost. See http://www.oer4schools.org for examples as to how we use boxes.

* Numbered section headings. Sections in the pdf aren't numbered, which isn't helpful (the magic word NUMBEREDHEADINGS is ignored). This may not be a problem for wikipedia articles, but when writing materials for teacher education where you just need to be able to refer to the number of the section (e.g. during workshops).

* We also make extensive use of the semantic mediawiki extension, e.g. to assign episodes to our videos. Again, this isn't implemented in the current pdf rendering pipeline.

I am not fully up to speed with what the plans are, but if the proposal is html->pdf rendering, rather than wiki text -> pdf rendering, then the above issues would be solved anyway.

= Our use cases =

More widely: What are our use cases? Our OER4Schools resource is used by teachers in Zambia for professional development, with very limited connectivity. The following scenarios are critical in this work (and would be similarly critical for most teacher education scenarios in sub-Saharan Africa):

'''Scenario 1: Pdf / print.''' We need to be able to print our whole professional development programme (around 200 pages). At the moment, we print each wiki page needed to pdf, and then collate them. It's not a great process. We can't use the collection extension because of the above issues.

'''Scenario 2: Use on local web server.''' We would like to be able to produce a static stand-alone version of the wiki (in html) that can run off a local web server. It would be good if links to any non-static content pointed back at the live version (e.g. links to other namespaces, such as 'Special', as well as 'edit'/history links). Ideally, the same (or a similar) version could run off a memory stick for use on netbooks. We have tinkered with some scripts, and there are other scripts out there: We'd love some help in finding something robust.

'''Scenario 3: Use on tablets / phones.''' We would love to have a version for mobile phones and tablets. Tablets are overtaking netbooks at the moment, and are starting to become available cheaply. This comes in two versions:

* '''Offline access:''' We'd love to have some advice how we can achieve this with ZIM. I guess one issue is that we would want to update our resource, and it would be good if that didn't mean that the whole resource needs to be downloaded again. The biggest items are uploads (files, images, audio, video). I think it would be ok for the wiki text to be re-downloaded, but it would not be feasible for us to re-download uploads.

* '''Online access:''' We'd love some advice on how to adapt the Wikipedia apps to work with our wiki, to give efficient access.

A little further off topic: We would also like to implement the mediawiki mobile rendering (as m.orbit.educ.cam.ac.uk). If somebody wanted to help us with this, we would really appreciate it.

I'd certainly be happy to engage in the discussion, and help / test new ideas in our context!

All the best, Bjoern

On 13 November 2013 09:59, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Hi Erik

This is great to see you speaking about this now.

Le 13/11/2013 06:51, Erik Moeller a écrit :

...
how important is ZIM support in Collections (the "Create a book" feature) on Wikimedia sites? We implemented this a while ago to support offline efforts. Since collections are still typically very much limited in size, it's not a very viable option for huge offline exports, more for batches of articles on related topics. Do people currently rely on this functionality for offline deployments?

Kiwix, as a project, does not rely directly on the WM ZIM export, but many of our users do. Of course they suffer of the limitations of the current solution and frankly: most of them are not aware of this feature.

So, this would be for us an impairment. But, I agree something should be done. IMO we should somehow try to get 3 important output formats:

PDF (adapted for really small collections)

EPUB (the most used free ebook format)

ZIM (for bigger collections)

...
We're re-implementing the rendering pipeline for Collections to ensure long-term maintainability, and our default would be to eliminate initially all formats except for PDF if we don't absolutely have to support them. I'll see if we can get some metrics on current ZIM file usage via the Collection extension, but it'd be nice to get qualitative feedback as well.

(More background at: https://www.mediawiki.org/wiki/PDF_rendering )

I have also seen your email to wikiteck-l: http://lists.wikimedia.org/pipermail/wikitech-l/2013-November/073059.html

I think the choice of Parsoid as a rendering backend is a really good one for PDF. I have always been advocating the HTML2PDF approach. I also think that Parsoid delivers the mandatory information to hack the HTML correctly and adapt it for offline usage.

That's why I have been working since March on a solution called mwoffliner (also using nodejs, like Parsoid): https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.j...

Mwoffliner: 1 - Download a selection of articles from the Parsoid API 2 - Rewrite the HTML code 3 - Write the ZIM file (not yet implemented, files are written on the filesystem)

You can have a idea of the rendering with this whole WPRU collection (ZIM file served with kiwix-serve): http://library.kiwix.org/wikipedia_ru_all

If I correctly understand, points 1&2 are similar to what you plan to do for the new PDF pipeline. So, this would be great to collaborate on this and maintain the ZIM output. How does it sounds?

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Emmanuel Engelhart

4:28 a.m.

Dear Bjoern

First of all, thank you very much for your email. This shows that this topic is not only about our Wikimedia project, but also about many other Mediawiki users.

I will try to answer to the ZIM related part of your message.

Le 13/11/2013 12:42, Bjoern Hassler a écrit :

...

'''Scenario 2: Use on local web server.''' We would like to be able to produce a static stand-alone version of the wiki (in html) that can run off a local web server. It would be good if links to any non-static content pointed back at the live version (e.g. links to other namespaces, such as 'Special', as well as 'edit'/history links). Ideally, the same (or a similar) version could run off a memory stick for use on netbooks. We have tinkered with some scripts, and there are other scripts out there: We'd love some help in finding something robust.

Kiwix-serve is able to serve any ZIM file: http://www.kiwix.org/wiki/Kiwix-serve

...

'''Scenario 3: Use on tablets / phones.''' We would love to have a version for mobile phones and tablets. Tablets are overtaking netbooks at the moment, and are starting to become available cheaply. This comes in two versions:

We have Kiwix for Android which is able to open any ZIM file: https://play.google.com/store/apps/details?id=org.kiwix.kiwixmobile

We want to develop a version for iOS, but for now there no concrete agenda: http://www.kiwix.org/wiki/IOS

...

'''Offline access:''' We'd love to have some advice how we can

achieve this with ZIM. I guess one issue is that we would want to update our resource, and it would be good if that didn't mean that the whole resource needs to be downloaded again. The biggest items are uploads (files, images, audio, video). I think it would be ok for the wiki text to be re-downloaded, but it would not be feasible for us to re-download uploads.

We have a still in dev, but already working solution for ZIM incremental update. This should be available for users in a few months. But, as far as I can see, you mediawiki is not too big, so the ZIM file shouldn't be too big to.

The real problem is the ZIM file generation. The future solution based on Parsoid should allow you (and anyone if your wiki is public) to build easily a ZIM file of it. For now we need to fix things on Parsoid and Kiwix side before having a perfectly usable solution.

But, if you achieve to get a dev instance of Parsoid working onr your wiki, I would be happy to try to build a ZIM file for you: https://www.mediawiki.org/wiki/Parsoid

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Erik Moeller

11:49 p.m.

On Wed, Nov 13, 2013 at 3:42 AM, Bjoern Hassler bjohas+mw@gmail.com wrote:

...

We also make extensive use of the semantic mediawiki extension, e.g.

to assign episodes to our videos. Again, this isn't implemented in the current pdf rendering pipeline.

I'm guessing this could be an issue with Parsoid, depending on where the extra SMW markup is used. Is the use limited to templates?

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Bjoern Hassler

14 Nov 14 Nov

12:35 a.m.

Hi Erik,

...

...
to assign episodes to our videos. Again, this isn't implemented in the current pdf rendering pipeline.

I'm guessing this could be an issue with Parsoid, depending on where the extra SMW markup is used. Is the use limited to templates?

Sorry, misunderstanding. But "current pdf rendering pipeline" I meant the current Pediapress pipeline.

But yes, SMW markup is limited to templates. However, we're not using parsoid. I am checking whether we can get it installed to try out ZIM packaging.

Bjoern

Erik Moeller

13 Nov 13 Nov

11:44 p.m.

On Wed, Nov 13, 2013 at 1:59 AM, Emmanuel Engelhart kelson@kiwix.org wrote:

...

That's why I have been working since March on a solution called mwoffliner (also using nodejs, like Parsoid): https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.j...

Mwoffliner: 1 - Download a selection of articles from the Parsoid API 2 - Rewrite the HTML code 3 - Write the ZIM file (not yet implemented, files are written on the filesystem)

Very cool! It may very well be possible to integrate this with the rendering pipeline in the first iteration, at least as a stretch goal. CCing Matt & Scott though I suspect they're already aware.

NB - we did run some numbers, and we're currently getting at most ~100 ZIM downloads/day from collections, across all wikis combined.

Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

rupert THURNER

17 Nov 17 Nov

10:58 a.m.

On Thu, Nov 14, 2013 at 8:44 AM, Erik Moeller erik@wikimedia.org wrote:

...

On Wed, Nov 13, 2013 at 1:59 AM, Emmanuel Engelhart kelson@kiwix.org wrote:

...
That's why I have been working since March on a solution called mwoffliner (also using nodejs, like Parsoid): https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.j...

Mwoffliner: 1 - Download a selection of articles from the Parsoid API 2 - Rewrite the HTML code 3 - Write the ZIM file (not yet implemented, files are written on the filesystem)

Very cool! It may very well be possible to integrate this with the rendering pipeline in the first iteration, at least as a stretch goal. CCing Matt & Scott though I suspect they're already aware.

NB - we did run some numbers, and we're currently getting at most ~100 ZIM downloads/day from collections, across all wikis combined.

i was surprised now about myself. i know openzim since long time, i once even tried the collections extension in the beginning and found the user interface cruel. i installed kiwix on a pc - where i did not need it. i tried to install kiwix on android, but the android version was to old to run. then i hijacked a phone to try it there, and the zim file i wanted to use was so big that the fat32 filesystem could not store it. i wanted to take contens abroad because the mobile phone fees are just too expensive and its a hassle to always chase for a wifi hostspot but i did not go back to the collections extension. so - no downloads from me, even if i would need it. but - i was not able to connect the dots until your mail, erik. thank you so much for it!

so i know now that i want pdf to print, and openzim to take away. i am wondering how i get, with this extension, the articles about london from wikipedia and wikivoyage into one book / zim file?

rupert.

Emmanuel Engelhart

19 Nov 19 Nov

5:11 a.m.

Le 14/11/2013 08:44, Erik Moeller a écrit :

...

NB - we did run some numbers, and we're currently getting at most ~100 ZIM downloads/day from collections, across all wikis combined.

Nice to know. This is really not so much. I don't know how it is for other formats but we certainly can do a lot better.

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Charles Andres

20 Nov 20 Nov

6:29 a.m.

Hi Eric,

I would like to share that the ZIM support in the create a book feature may became more and more important in the future.

Actually in our new born education program, we are promoting a lot the usage of Kiwix and the creation of collection from Wikipedia by this feature, an I think it will be more and more used.

Sincerely

Charles

"Wikimedia CH" – Association for the advancement of free knowledge – www.wikimedia.ch Skype: charles.andres.wmch IRC://irc.freenode.net/wikimedia-ch

...

Le 13 nov. 2013 à 06:51, Erik Moeller erik@wikimedia.org a écrit :

Hi offline community,

how important is ZIM support in Collections (the "Create a book" feature) on Wikimedia sites? We implemented this a while ago to support offline efforts. Since collections are still typically very much limited in size, it's not a very viable option for huge offline exports, more for batches of articles on related topics. Do people currently rely on this functionality for offline deployments?

We're re-implementing the rendering pipeline for Collections to ensure long-term maintainability, and our default would be to eliminate initially all formats except for PDF if we don't absolutely have to support them. I'll see if we can get some metrics on current ZIM file usage via the Collection extension, but it'd be nice to get qualitative feedback as well.

(More background at: https://www.mediawiki.org/wiki/PDF_rendering )

Thanks, Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

3892

Age (days ago)

3899

Last active (days ago)

offline-l@lists.wikimedia.org

10 comments

6 participants

tags (0)

participants (6)

Bjoern Hassler
Charles Andres
Emmanuel Engelhart
Erik Moeller
Federico Leva (Nemo)
rupert THURNER