Hi folks,
for a long time we've relied on the mwlib libraries by PediaPress to generate PDFs on Wikimedia sites. These have served us well (we generate >200K PDFs/day), but they architecturally pre-date a lot of important developments in MediaWiki, and actually re-implement the MediaWiki parser (!) in Python. The occasion of moving the entire PDF service to a new data-center has given us reason to re-think the architecture and come up with a minimally viable alternative that we can support long term.
Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.
Thanks to Matt Walker, C. Scott Ananian, Max Semenik, Brad Jorsch and Jeff Green for joining the effort, and thanks to the PediaPress folks for giving background as needed. Ideally we'd like to continue to support printed book generation via PediaPress' web service, while completely replacing the rendering tech stack on the WMF side of things (still using the Collection extension to manage books). We may need to deprecate some output formats - more on that as we go.
We've got the collection-alt-renderer project set up on Labs (thanks Andrew) and can hopefully get a plan to our ops team soon as to how the new setup could work.
If you want to peek - work channel is #mediawiki-pdfhack on FreeNode.
Live notes here: http://etherpad.wikimedia.org/p/pdfhack
Stuff will be consolidated here: https://www.mediawiki.org/wiki/PDF_rendering
Some early experiments with different rendering strategies here: https://github.com/cscott/pdf-research
Some improvements to Collection extension underway: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/...
More soon, Erik