Re-implementing PDF support

List overview All Threads
Download

newer

older

Architectural leadership in...

jQuery UI may not be loaded by...

Erik Moeller

13 Nov 2013 13 Nov '13

7:45 a.m.

Hi folks,

for a long time we've relied on the mwlib libraries by PediaPress to generate PDFs on Wikimedia sites. These have served us well (we generate >200K PDFs/day), but they architecturally pre-date a lot of important developments in MediaWiki, and actually re-implement the MediaWiki parser (!) in Python. The occasion of moving the entire PDF service to a new data-center has given us reason to re-think the architecture and come up with a minimally viable alternative that we can support long term.

Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.

Thanks to Matt Walker, C. Scott Ananian, Max Semenik, Brad Jorsch and Jeff Green for joining the effort, and thanks to the PediaPress folks for giving background as needed. Ideally we'd like to continue to support printed book generation via PediaPress' web service, while completely replacing the rendering tech stack on the WMF side of things (still using the Collection extension to manage books). We may need to deprecate some output formats - more on that as we go.

We've got the collection-alt-renderer project set up on Labs (thanks Andrew) and can hopefully get a plan to our ops team soon as to how the new setup could work.

If you want to peek - work channel is #mediawiki-pdfhack on FreeNode.

Live notes here: http://etherpad.wikimedia.org/p/pdfhack

Stuff will be consolidated here: https://www.mediawiki.org/wiki/PDF_rendering

Some early experiments with different rendering strategies here: https://github.com/cscott/pdf-research

Some improvements to Collection extension underway: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/...

More soon, Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Show replies by date

Strainu

13 Nov 13 Nov

1:55 p.m.

Hi,

I'm grabbing this opportunity to bring up 3 bugs related to mwlib that deserve a larger discussion and should perhaps be implemented differently in the new version.

1. https://bugzilla.wikimedia.org/show_bug.cgi?id=56560 - PDF creation tool considers IPv6 addresses as users, not anonymous.

I've pushed a patched for this and it was merged; however, the detection was based on regex and, as a quick google search will tell you, it's not so obvious to do a regex to cover all IPv6 cases. Perhaps the information anon user/logged in user might be sent from MW.

2. https://bugzilla.wikimedia.org/show_bug.cgi?id=56219 - PDF creation tool excludes contributors with a "bot" substring in their username

I've also pushed a pull request for this one, but it was rejected based on the en.wp policy that prevents bot-like usernames for humans. The problem is more complex though:

a. Should bots be credited for their edits? While most of them do simple tasks, we have recently seen an increase in bot-created content. On ro.wp we even have a few lists only edited by robots. b. If the robots should _not_ be credited, how do we detect them? Ideally, there should be an automatical way to do so, but according to http://www.mediawiki.org/wiki/Bots, it only works for recent changes. Less ideally, only users with "bot" at the end should be removed, in order to keep users like https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is not a robot, but has "bot" in the name) in the contributor list.

3. https://bugzilla.wikimedia.org/show_bug.cgi?id=2994 - Automatically generated count and list of contributors to an article (authorship tracking)

This is an old enhancement request, revived by me last month in a wikimedia-l thread: http://lists.wikimedia.org/pipermail/wikimedia-l/2013-October/128575.html . The idea is to decide if and how to credit: a. vandals b. reverters c. contributors which had their valid contributions rephrased or replaced from the article. d. contributors with valid contributions but invalid names

I hope the people working on this feature will take the time to consider these issues and come up with solutions for them.

Thanks, Strainu

2013/11/13 Erik Moeller erik@wikimedia.org:

...

Hi folks,

for a long time we've relied on the mwlib libraries by PediaPress to generate PDFs on Wikimedia sites. These have served us well (we generate >200K PDFs/day), but they architecturally pre-date a lot of important developments in MediaWiki, and actually re-implement the MediaWiki parser (!) in Python. The occasion of moving the entire PDF service to a new data-center has given us reason to re-think the architecture and come up with a minimally viable alternative that we can support long term.

Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.

Thanks to Matt Walker, C. Scott Ananian, Max Semenik, Brad Jorsch and Jeff Green for joining the effort, and thanks to the PediaPress folks for giving background as needed. Ideally we'd like to continue to support printed book generation via PediaPress' web service, while completely replacing the rendering tech stack on the WMF side of things (still using the Collection extension to manage books). We may need to deprecate some output formats - more on that as we go.

We've got the collection-alt-renderer project set up on Labs (thanks Andrew) and can hopefully get a plan to our ops team soon as to how the new setup could work.

If you want to peek - work channel is #mediawiki-pdfhack on FreeNode.

Live notes here: http://etherpad.wikimedia.org/p/pdfhack

Stuff will be consolidated here: https://www.mediawiki.org/wiki/PDF_rendering

Some early experiments with different rendering strategies here: https://github.com/cscott/pdf-research

Some improvements to Collection extension underway: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/...

More soon, Erik

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brad Jorsch (Anomie)

6:04 p.m.

Note these are my own thoughts and not anything representative of the team.

On Wed, Nov 13, 2013 at 6:55 AM, Strainu strainu10@gmail.com wrote:

...

b. If the robots should _not_ be credited, how do we detect them? Ideally, there should be an automatical way to do so, but according to http://www.mediawiki.org/wiki/Bots, it only works for recent changes. Less ideally, only users with "bot" at the end should be removed, in order to keep users like https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is not a robot, but has "bot" in the name) in the contributor list.

Another way to exclude (most) bots would be to skip any user with the "bot" user right. Note though that this would still include edits by unflagged bots, or by bots that have since been decommissioned and the bot flag removed.

Personally, though, I do agree that excluding any user with "bot" in the name (or even with a name ending in "bot") is a bad idea even if just applied to enwiki, and worse when applied to other wikis that may have different naming conventions.

...

. The idea is to decide if and how to credit: a. vandals b. reverters c. contributors which had their valid contributions rephrased or replaced from the article. d. contributors with valid contributions but invalid names

The hard part there is detecting these, particularly case (c). And even then, the article may still be based on the original work in a copyright sense even if no single word of the original edit remains.

Then there's also the situation where A makes an edit that is partially useful and partially bad, B reverts, then C comes along and incorporates parts of C's edit.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Strainu

14 Nov 14 Nov

4:49 p.m.

Thanks Brad,

I'm wondering if it wouldn't make sense to have a dedicated bugday at the end of the sprint?

Strainu

2013/11/13 Brad Jorsch (Anomie) bjorsch@wikimedia.org:

...

Note these are my own thoughts and not anything representative of the team.

On Wed, Nov 13, 2013 at 6:55 AM, Strainu strainu10@gmail.com wrote:

...
b. If the robots should _not_ be credited, how do we detect them? Ideally, there should be an automatical way to do so, but according to http://www.mediawiki.org/wiki/Bots, it only works for recent changes. Less ideally, only users with "bot" at the end should be removed, in order to keep users like https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is not a robot, but has "bot" in the name) in the contributor list.

Another way to exclude (most) bots would be to skip any user with the "bot" user right. Note though that this would still include edits by unflagged bots, or by bots that have since been decommissioned and the bot flag removed.

Personally, though, I do agree that excluding any user with "bot" in the name (or even with a name ending in "bot") is a bad idea even if just applied to enwiki, and worse when applied to other wikis that may have different naming conventions.

...
. The idea is to decide if and how to credit: a. vandals b. reverters c. contributors which had their valid contributions rephrased or replaced from the article. d. contributors with valid contributions but invalid names

The hard part there is detecting these, particularly case (c). And even then, the article may still be based on the original work in a copyright sense even if no single word of the original edit remains.

Then there's also the situation where A makes an edit that is partially useful and partially bad, B reverts, then C comes along and incorporates parts of C's edit.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

C. Scott Ananian

5:49 p.m.

Let's see what sorts of bugs crop up? In my (limited) experience, the most common issues are probably article content which renders poorly as a PDF for some reason. Those bugs aren't easy to fix in a bug day sprint, since they tend to crop up slowly over time as people use the service and collect lists of suboptimal pages. (And some of these issues might be eventually traced to Parsoid, and we know from experience that fixing those ends up being a gradual collaboration between authors and developers to determine whether the wikitext should be rewritten or the parser extended, etc.)

On the other hand, if our servers are crashing or the UI code is buggy, etc, then a bug day would probably be useful to squash those sorts of things. --scott

On Thu, Nov 14, 2013 at 9:49 AM, Strainu strainu10@gmail.com wrote:

...

Thanks Brad,

I'm wondering if it wouldn't make sense to have a dedicated bugday at the end of the sprint?

Strainu

2013/11/13 Brad Jorsch (Anomie) bjorsch@wikimedia.org:

...
Note these are my own thoughts and not anything representative of the

team.

...
On Wed, Nov 13, 2013 at 6:55 AM, Strainu strainu10@gmail.com wrote:

...
b. If the robots should _not_ be credited, how do we detect them? Ideally, there should be an automatical way to do so, but according to http://www.mediawiki.org/wiki/Bots, it only works for recent changes. Less ideally, only users with "bot" at the end should be removed, in order to keep users like https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is not a robot, but has "bot" in the name) in the contributor list.

Another way to exclude (most) bots would be to skip any user with the "bot" user right. Note though that this would still include edits by unflagged bots, or by bots that have since been decommissioned and the bot flag removed.

Personally, though, I do agree that excluding any user with "bot" in the name (or even with a name ending in "bot") is a bad idea even if just applied to enwiki, and worse when applied to other wikis that may have different naming conventions.

...
. The idea is to decide if and how to credit: a. vandals b. reverters c. contributors which had their valid contributions rephrased or replaced from the article. d. contributors with valid contributions but invalid names

The hard part there is detecting these, particularly case (c). And even then, the article may still be based on the original work in a copyright sense even if no single word of the original edit remains.

Then there's also the situation where A makes an edit that is partially useful and partially bad, B reverts, then C comes along and incorporates parts of C's edit.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- (http://cscott.net)

Tyler Romeo

13 Nov 13 Nov

6:10 p.m.

On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller erik@wikimedia.org wrote:

...

Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.

I don't have anything against this, but what's the reasoning? You now have to parse the wikitext into HTML5 and then parse the HTML5 into PDF. I'm guessing you've found some library that automatically "prints" HTML5, which would make sense since browsers do that already, but I'm just curious.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Emmanuel Engelhart

6:13 p.m.

Le 13/11/2013 17:10, Tyler Romeo a écrit :

...

On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller erik@wikimedia.org wrote:

...
Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.

I don't have anything against this, but what's the reasoning? You now have to parse the wikitext into HTML5 and then parse the HTML5 into PDF. I'm guessing you've found some library that automatically "prints" HTML5, which would make sense since browsers do that already, but I'm just curious.

Here is an example about how this works: https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

Emmanuel

-- Kiwix - Wikipedia Offline & more * Web: http://www.kiwix.org * Twitter: https://twitter.com/KiwixOffline * more: http://www.kiwix.org/wiki/Communication

Brad Jorsch (Anomie)

6:16 p.m.

On Wed, Nov 13, 2013 at 11:10 AM, Tyler Romeo tylerromeo@gmail.com wrote:

...

On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller erik@wikimedia.org wrote: I'm guessing you've found some library that automatically "prints" HTML5, which would make sense since browsers do that already, but I'm just curious.

Yes, phantomjs, as mentioned in the original message.

To be more specific, phantomjs is basically WebKit without a GUI, so the output would be roughly equivalent to opening the page in Chrome or Safari and printing to a PDF. Future plans include using bookjs or the like to improve the rendering.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Tyler Romeo

7:07 p.m.

On Wed, Nov 13, 2013 at 11:16 AM, Brad Jorsch (Anomie) < bjorsch@wikimedia.org> wrote:

...

Yes, phantomjs, as mentioned in the original message.

To be more specific, phantomjs is basically WebKit without a GUI, so the output would be roughly equivalent to opening the page in Chrome or Safari and printing to a PDF. Future plans include using bookjs or the like to improve the rendering.

Aha awesome. Thanks for explaining.

*-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science

Gabriel Wicke

7:34 p.m.

On 11/13/2013 08:10 AM, Tyler Romeo wrote:

...

On Wed, Nov 13, 2013 at 12:45 AM, Erik Moeller erik@wikimedia.org wrote:

...
Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.

I don't have anything against this, but what's the reasoning? You now have to parse the wikitext into HTML5 and then parse the HTML5 into PDF.

We are already parsing all edited pages to HTML5 and will also start storing (rather than just caching) this HTML very soon, so there will not be any extra parsing involved in the longer term. Getting the HTML will basically be a request for a static HTML page.

Gabriel

4061

Age (days ago)

4062

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

7 participants

tags (0)

participants (7)

Brad Jorsch (Anomie)
C. Scott Ananian
Emmanuel Engelhart
Erik Moeller
Gabriel Wicke
Strainu
Tyler Romeo