Best script for static html of a mediawiki

List overview All Threads
Download

newer

older

XOWA has released a complete set...

How important is ZIM support in...

Bjoern Hassler

14 Nov 2013 14 Nov '13

9:48 a.m.

Hello!

What script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

To illustrate: E.g. one of our wiki pages is here: http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_...

(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see here: http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_wor...

Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly.

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

Looking forward to hearing about it! Bjoern

Show replies by date

rupert THURNER

14 Nov 14 Nov

11:58 a.m.

Is a zim file acceptable as well? Am 14.11.2013 10:50 schrieb "Bjoern Hassler" bjohas+mw@gmail.com:

...

Hello!

What script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

To illustrate: E.g. one of our wiki pages is here: http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_...

(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see here:

http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_wor...

Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly.

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

Looking forward to hearing about it! Bjoern

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Bjoern Hassler

12:39 p.m.

Hi Rupert,

Yes, ZIM is definitely one possibility, and definitely something we would like to explore. We would like to be able to provide our resource on a memory stick, and ZIM could work well for that.

There are two potential drawbacks:

(1) ZIM requires the reader software to read the file, so in some circumstances, a plain html version might be the best way.

(2) Emmanuel mentions that incremental ZIM updates are on the roadmap. For us, that's a very important feature, because we are dealing with low bandwidth - high cost connections. So we have to be able to create incremental updates.

So for now, we'd would probably be best off with ZIM as well as plain html.

Does the ZIM process create a stand-alone html version first, that is usable? That would be interesting.

Emmanuel has offered to create a ZIM file for us, and I am checking with our computing service at the moment whether we can run npm and nodejs on our server.

Bjoern

On 14 November 2013 11:58, rupert THURNER rupert.thurner@gmail.com wrote:

...

Is a zim file acceptable as well?

Am 14.11.2013 10:50 schrieb "Bjoern Hassler" bjohas+mw@gmail.com:

...
Hello!

What script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

To illustrate: E.g. one of our wiki pages is here: http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_...

(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see here:

http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_wor...

Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly.

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

Looking forward to hearing about it! Bjoern

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

rupert THURNER

6:05 p.m.

Where are you located if i may ask? Would sending an usb key via snail mail be a viable update option?

Rupert Am 14.11.2013 13:40 schrieb "Bjoern Hassler" bjohas+mw@gmail.com:

...

Hi Rupert,

Yes, ZIM is definitely one possibility, and definitely something we would like to explore. We would like to be able to provide our resource on a memory stick, and ZIM could work well for that.

There are two potential drawbacks:

(1) ZIM requires the reader software to read the file, so in some circumstances, a plain html version might be the best way.

(2) Emmanuel mentions that incremental ZIM updates are on the roadmap. For us, that's a very important feature, because we are dealing with low bandwidth - high cost connections. So we have to be able to create incremental updates.

So for now, we'd would probably be best off with ZIM as well as plain html.

Does the ZIM process create a stand-alone html version first, that is usable? That would be interesting.

Emmanuel has offered to create a ZIM file for us, and I am checking with our computing service at the moment whether we can run npm and nodejs on our server.

Bjoern

On 14 November 2013 11:58, rupert THURNER rupert.thurner@gmail.com wrote:

...
Is a zim file acceptable as well?

Am 14.11.2013 10:50 schrieb "Bjoern Hassler" bjohas+mw@gmail.com:

...
Hello!

What script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

To illustrate: E.g. one of our wiki pages is here:

http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

...
...
We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_...

...
...
(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see

here:

...
...
http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_wor...

...
...
Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly.

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

Looking forward to hearing about it! Bjoern

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Bjoern Hassler

15 Nov 15 Nov

9:08 a.m.

Hi Rupert,

Our projects are run by the Centre for Commonwealth Education (http://www.educ.cam.ac.uk/centres/cce/), at the University of Cambridge in the UK. Our project work is mainly in sub-Saharan Africa, but also in the Caribbean and Asia.

Sending a memory stick is possible in some circumstances, but it's not really scalable solution. A lot of the time an optimised, resumable download is the best option. There is some connectivity, but usually even a 'broadband' connection don't deliver more than intermittent modem speeds. However, with the right infrastructure, that's enough to transfer files of the order of 50-100MB over night.

While that's slow, sending a memory stick is also slow, and it takes a lot of effort to coordinate. For instance, in our work in Zambia, we tend to send stuff out with people travelling, and hand it over e.g. to a local NGO, who then sends it out with somebody already travelling (by bus or car).

All the best, Bjoern

On 14 November 2013 18:05, rupert THURNER rupert.thurner@gmail.com wrote:

...

Where are you located if i may ask? Would sending an usb key via snail mail be a viable update option?

Rupert

Am 14.11.2013 13:40 schrieb "Bjoern Hassler" bjohas+mw@gmail.com:

...
Hi Rupert,

Yes, ZIM is definitely one possibility, and definitely something we would like to explore. We would like to be able to provide our resource on a memory stick, and ZIM could work well for that.

There are two potential drawbacks:

(1) ZIM requires the reader software to read the file, so in some circumstances, a plain html version might be the best way.

(2) Emmanuel mentions that incremental ZIM updates are on the roadmap. For us, that's a very important feature, because we are dealing with low bandwidth - high cost connections. So we have to be able to create incremental updates.

So for now, we'd would probably be best off with ZIM as well as plain html.

Does the ZIM process create a stand-alone html version first, that is usable? That would be interesting.

Emmanuel has offered to create a ZIM file for us, and I am checking with our computing service at the moment whether we can run npm and nodejs on our server.

Bjoern

On 14 November 2013 11:58, rupert THURNER rupert.thurner@gmail.com wrote:

...
Is a zim file acceptable as well?

Am 14.11.2013 10:50 schrieb "Bjoern Hassler" bjohas+mw@gmail.com:

...
Hello!

What script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

To illustrate: E.g. one of our wiki pages is here:

http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_...

(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see here:

http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_wor...

Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly.

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

Looking forward to hearing about it! Bjoern

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Bjoern Hassler

19 Nov 19 Nov

6:13 p.m.

Hi all,

I'd like to ask a follow-up question to this. What do you guys use for converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers - is that the best solution around?

Many thanks! Bjoern

On 14 November 2013 09:48, Bjoern Hassler bjohas+mw@gmail.com wrote:

...

Hello!

What script would you recommend to create a static offline version of a mediawiki? (Perhaps with and without parsoid?)

I've been looking for a good solution for ages, and have experimented with a few things. Here's what we currently do. It's not perfect, and really a bit too cumbersome, but it works as a proof of concept.

To illustrate: E.g. one of our wiki pages is here: http://orbit.educ.cam.ac.uk/wiki/OER4Schools/What_is_interactive_teaching

We have a "mirror" script, that uses the API to generate an HTML version of a wiki page (which is then 'wrapped' in a basic menu):

http://orbit.educ.cam.ac.uk/orbit_mirror/index.php?page=OER4Schools/What_is_...

(Some log info printed at the bottom of the page, which will provide some hints as to what is going on.)

The resulting page is as low-bandwidth as possible (which is one of our use cases). The original idea with the mirror php script was that you could run it on your own server: It only requests pages if they have changed, and keeps a cache, which allows viewing pages if your server has no connectivity. (You could of course use a cache anyway, and there's advantages/disadvantages compared to this more explicit caching method.) The script rewrites urls so that normal page links stay within the mirror, but links for editing and history point back at the wiki (see tabs along the top of the page).

The mirror script also produces (and caches) a static web page, see here: http://orbit.educ.cam.ac.uk/orbit_mirror/site/OER4Schools%252FHow_to_run_wor...

Assuming that you've run a wget across the mirror, then the site will be completely mirrored in '/site'. You can then tar up '/site' and distribute it alongside your w/images directory, and you have a static copy, or use rsync to incrementally update '/site' and w/images on another server.

There's also a api-based process, that can work out which pages have changes, and refreshes the mirror accordingly.

Most of what I am using is in the mediawiki software already (i.e. API->html), and it would be great to have a solution like this, that could generate an offline site on the fly. Perhaps one could add another export format to the API, and then an extension could generate the offline site and keep it up to date as pages on the main wiki are changing. Does this make sense? Would anybody be up for collaborating on implementing this? Are there better things in the pipeline?

I can see why you perhaps wouldn't want it for one of the major wikimedia sites, or why it might be inefficient somehow. But for our use cases, for a small-ish wiki, with a set of poorly connected users across the digital divide, it would be fantastic.

So - what are your solutions for creating a static offline copy of a mediawiki?

Looking forward to hearing about it! Bjoern

Emmanuel Engelhart

6:15 p.m.

Le 19/11/2013 19:13, Bjoern Hassler a écrit :

...

I'd like to ask a follow-up question to this. What do you guys use for converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers - is that the best solution around?

phantomjs PDF rasteriser https://coderwall.com/p/5vmo1g

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Bjoern Hassler

6:34 p.m.

Hi Emmanuel,

thanks - I don't think that will work for us. Because it creates a raster image, the resulting pdf files are huge, and text is not selectable.

Is it possible to use phantomjs to produce non-raster pdf?

I've so far tried wkhtmltopdf, but the rending wasn't great, and wkpdf, which produced very nice pdf, but unfortunately some lines are horizontally cut across two pages. This may be to do with webkit on my OS X install (10.7), but I couldn't test on 10.9 because wkpdf doesn't work with Ruby 2.0. Does anybody have any thoughts on this? (I realise this is perhaps a little off topic, but I guess html5->pdf needs to be solved for mediawiki as well.)

Any other thoughts? Bjoern

On 19 November 2013 18:15, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Le 19/11/2013 19:13, Bjoern Hassler a écrit :

...
I'd like to ask a follow-up question to this. What do you guys use for converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers - is that the best solution around?

phantomjs PDF rasteriser https://coderwall.com/p/5vmo1g

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

renaud gaudin

6:38 p.m.

...

From the link in the comments of Emmanuel's link :

http://we-love-php.blogspot.com/2012/12/create-pdf-invoices-with-html5-and-p...

On Tue, Nov 19, 2013 at 6:34 PM, Bjoern Hassler bjohas+mw@gmail.com wrote:

...

Hi Emmanuel,

thanks - I don't think that will work for us. Because it creates a raster image, the resulting pdf files are huge, and text is not selectable.

Is it possible to use phantomjs to produce non-raster pdf?

I've so far tried wkhtmltopdf, but the rending wasn't great, and wkpdf, which produced very nice pdf, but unfortunately some lines are horizontally cut across two pages. This may be to do with webkit on my OS X install (10.7), but I couldn't test on 10.9 because wkpdf doesn't work with Ruby 2.0. Does anybody have any thoughts on this? (I realise this is perhaps a little off topic, but I guess html5->pdf needs to be solved for mediawiki as well.)

Any other thoughts? Bjoern

On 19 November 2013 18:15, Emmanuel Engelhart kelson@kiwix.org wrote:

...
Le 19/11/2013 19:13, Bjoern Hassler a écrit :

...
I'd like to ask a follow-up question to this. What do you guys use for converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers - is that the best solution around?

phantomjs PDF rasteriser https://coderwall.com/p/5vmo1g

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Bjoern Hassler

7:35 p.m.

Hi Renaud,

yes, that's right, but the pdf generated in that way appears to be rasterised, so the files are very large, not searchable, etc.

Perhaps it's possible with pantomjs? Does anybody know?

Bjoern

On 19 November 2013 18:38, renaud gaudin rgaudin@gmail.com wrote:

...

From the link in the comments of Emmanuel's link : http://we-love-php.blogspot.com/2012/12/create-pdf-invoices-with-html5-and-p...

On Tue, Nov 19, 2013 at 6:34 PM, Bjoern Hassler bjohas+mw@gmail.com wrote:

...
Hi Emmanuel,

thanks - I don't think that will work for us. Because it creates a raster image, the resulting pdf files are huge, and text is not selectable.

Is it possible to use phantomjs to produce non-raster pdf?

I've so far tried wkhtmltopdf, but the rending wasn't great, and wkpdf, which produced very nice pdf, but unfortunately some lines are horizontally cut across two pages. This may be to do with webkit on my OS X install (10.7), but I couldn't test on 10.9 because wkpdf doesn't work with Ruby 2.0. Does anybody have any thoughts on this? (I realise this is perhaps a little off topic, but I guess html5->pdf needs to be solved for mediawiki as well.)

Any other thoughts? Bjoern

On 19 November 2013 18:15, Emmanuel Engelhart kelson@kiwix.org wrote:

...
Le 19/11/2013 19:13, Bjoern Hassler a écrit :

...
I'd like to ask a follow-up question to this. What do you guys use for converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers - is that the best solution around?

phantomjs PDF rasteriser https://coderwall.com/p/5vmo1g

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

renaud gaudin

20 Nov 20 Nov

8:26 a.m.

Have you actually tried it ? I haven't but the samples to download in the article are selection-able, search-able and all. It uses an old version of phantomjs though…

On Tue, Nov 19, 2013 at 7:35 PM, Bjoern Hassler bjohas+mw@gmail.com wrote:

...

Hi Renaud,

yes, that's right, but the pdf generated in that way appears to be rasterised, so the files are very large, not searchable, etc.

Perhaps it's possible with pantomjs? Does anybody know?

Bjoern

On 19 November 2013 18:38, renaud gaudin rgaudin@gmail.com wrote:

...
From the link in the comments of Emmanuel's link :

http://we-love-php.blogspot.com/2012/12/create-pdf-invoices-with-html5-and-p...

...
On Tue, Nov 19, 2013 at 6:34 PM, Bjoern Hassler bjohas+mw@gmail.com

wrote:

...
...
Hi Emmanuel,

thanks - I don't think that will work for us. Because it creates a raster image, the resulting pdf files are huge, and text is not selectable.

Is it possible to use phantomjs to produce non-raster pdf?

I've so far tried wkhtmltopdf, but the rending wasn't great, and wkpdf, which produced very nice pdf, but unfortunately some lines are horizontally cut across two pages. This may be to do with webkit on my OS X install (10.7), but I couldn't test on 10.9 because wkpdf doesn't work with Ruby 2.0. Does anybody have any thoughts on this? (I realise this is perhaps a little off topic, but I guess html5->pdf needs to be solved for mediawiki as well.)

Any other thoughts? Bjoern

On 19 November 2013 18:15, Emmanuel Engelhart kelson@kiwix.org wrote:

...
Le 19/11/2013 19:13, Bjoern Hassler a écrit :

...
I'd like to ask a follow-up question to this. What do you guys use

for

...
...
...
...
converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers -

is

...
...
...
...
that the best solution around?

phantomjs PDF rasteriser https://coderwall.com/p/5vmo1g

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Bjoern Hassler

21 Nov 21 Nov

8:54 p.m.

Yes, I have tried it, and got a very large, unpaginated image of it.

Would you possibly be able to send me the url for the page your tried (off list), as well as the output? (Perhaps plus the version of phantomjs, OS, and the js script or a link to it?)

That would be very helpful. I'll try to generate it for the same url, and see how the output differs.

Thanks! Bjoern

On 20 November 2013 08:26, renaud gaudin rgaudin@gmail.com wrote:

...

Have you actually tried it ? I haven't but the samples to download in the article are selection-able, search-able and all. It uses an old version of phantomjs though…

On Tue, Nov 19, 2013 at 7:35 PM, Bjoern Hassler bjohas+mw@gmail.com wrote:

...
Hi Renaud,

yes, that's right, but the pdf generated in that way appears to be rasterised, so the files are very large, not searchable, etc.

Perhaps it's possible with pantomjs? Does anybody know?

Bjoern

On 19 November 2013 18:38, renaud gaudin rgaudin@gmail.com wrote:

...
From the link in the comments of Emmanuel's link :

http://we-love-php.blogspot.com/2012/12/create-pdf-invoices-with-html5-and-p...

On Tue, Nov 19, 2013 at 6:34 PM, Bjoern Hassler bjohas+mw@gmail.com wrote:

...
Hi Emmanuel,

thanks - I don't think that will work for us. Because it creates a raster image, the resulting pdf files are huge, and text is not selectable.

Is it possible to use phantomjs to produce non-raster pdf?

I've so far tried wkhtmltopdf, but the rending wasn't great, and wkpdf, which produced very nice pdf, but unfortunately some lines are horizontally cut across two pages. This may be to do with webkit on my OS X install (10.7), but I couldn't test on 10.9 because wkpdf doesn't work with Ruby 2.0. Does anybody have any thoughts on this? (I realise this is perhaps a little off topic, but I guess html5->pdf needs to be solved for mediawiki as well.)

Any other thoughts? Bjoern

On 19 November 2013 18:15, Emmanuel Engelhart kelson@kiwix.org wrote:

...
Le 19/11/2013 19:13, Bjoern Hassler a écrit :

...
I'd like to ask a follow-up question to this. What do you guys use for converting html to pdf (on the command line)? Let's say once Parsoid has generated html5, how do you get it into pdf? I've used (and struggled) a bit with wkpdf and enscript to generate page numbers - is that the best solution around?

phantomjs PDF rasteriser https://coderwall.com/p/5vmo1g

-- Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Emmanuel Engelhart

22 Nov 22 Nov

12:39 p.m.

Le 21/11/2013 21:54, Bjoern Hassler a écrit :

...

Yes, I have tried it, and got a very large, unpaginated image of it.

Would you possibly be able to send me the url for the page your tried (off list), as well as the output? (Perhaps plus the version of phantomjs, OS, and the js script or a link to it?)

That would be very helpful. I'll try to generate it for the same url, and see how the output differs.

I have tried myself on Ubuntu 13.10 with phantomjs package: $phantomjs /usr/share/doc/phantomjs/examples/rasterize.js 'http://en.wikipedia.org/' enwiki.pdf

The output is available here: http://tmp.kiwix.org/enwiki.pdf

Like you can see the text can be selected, so this is not a big image.

But, at the same time, we have an OSX specific bug, which maybe explains your result: https://github.com/ariya/phantomjs/issues/10373

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.orghttp://tmp.kiwix.org/enwiki.pdf * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Bjoern Hassler

1:14 p.m.

Hi Emmanuel,

that's extremely helpful! I will try this on linux!

Incidentally, we managed to resolve some of the issues that we had with wkpdf by specifying specific styles, such as width/height/font-size. The problems we were having were probably to do with the quartz rendering framework rather than with wkpdf directly!

Thanks so much! Bjoern

On 22 November 2013 12:39, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Le 21/11/2013 21:54, Bjoern Hassler a écrit :

...
Yes, I have tried it, and got a very large, unpaginated image of it.

Would you possibly be able to send me the url for the page your tried (off list), as well as the output? (Perhaps plus the version of phantomjs, OS, and the js script or a link to it?)

That would be very helpful. I'll try to generate it for the same url, and see how the output differs.

I have tried myself on Ubuntu 13.10 with phantomjs package: $phantomjs /usr/share/doc/phantomjs/examples/rasterize.js 'http://en.wikipedia.org/' enwiki.pdf

The output is available here: http://tmp.kiwix.org/enwiki.pdf

Like you can see the text can be selected, so this is not a big image.

But, at the same time, we have an OSX specific bug, which maybe explains your result: https://github.com/ariya/phantomjs/issues/10373

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.orghttp://tmp.kiwix.org/enwiki.pdf

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Bjoern Hassler

1:51 p.m.

Hi Emmanuel,

I've tried this out on Raspberry Pi, and indeed the output is a proper pdf.

Do you have any thoughts as to why the paper size comes out as 10cm by 77cm?

Thanks, Bjoern

On 22 November 2013 12:39, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Le 21/11/2013 21:54, Bjoern Hassler a écrit :

...
Yes, I have tried it, and got a very large, unpaginated image of it.

Would you possibly be able to send me the url for the page your tried (off list), as well as the output? (Perhaps plus the version of phantomjs, OS, and the js script or a link to it?)

That would be very helpful. I'll try to generate it for the same url, and see how the output differs.

I have tried myself on Ubuntu 13.10 with phantomjs package: $phantomjs /usr/share/doc/phantomjs/examples/rasterize.js 'http://en.wikipedia.org/' enwiki.pdf

The output is available here: http://tmp.kiwix.org/enwiki.pdf

Like you can see the text can be selected, so this is not a big image.

But, at the same time, we have an OSX specific bug, which maybe explains your result: https://github.com/ariya/phantomjs/issues/10373

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.orghttp://tmp.kiwix.org/enwiki.pdf

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

Emmanuel Engelhart

1:55 p.m.

Le 22/11/2013 14:51, Bjoern Hassler a écrit :

...

I've tried this out on Raspberry Pi, and indeed the output is a proper pdf.

Nice.

...

Do you have any thoughts as to why the paper size comes out as 10cm by 77cm?

I don't know phantomjs in detail, but if you have a look to the code, you can see that the output size seems to be customisable: https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Bjoern Hassler

1:57 p.m.

Ok thanks, that could have occurred to me! Thanks for bearing with me!

Bjoern

On 22 November 2013 13:55, Emmanuel Engelhart kelson@kiwix.org wrote:

...

Le 22/11/2013 14:51, Bjoern Hassler a écrit :

...
I've tried this out on Raspberry Pi, and indeed the output is a proper

pdf.

Nice.

...
Do you have any thoughts as to why the paper size comes out as 10cm by 77cm?

I don't know phantomjs in detail, but if you have a look to the code, you can see that the output size seems to be customisable: https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

Emmanuel

Kiwix - Wikipedia Offline & more

Web: http://www.kiwix.org

Twitter: https://twitter.com/KiwixOffline

more: http://www.kiwix.org/wiki/Communication

Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l

3890

Age (days ago)

3898

Last active (days ago)

offline-l@lists.wikimedia.org

16 comments

4 participants

tags (0)

participants (4)

Bjoern Hassler
Emmanuel Engelhart
renaud gaudin
rupert THURNER