New subject: [Engineering] Status update on new Collections PDF Renderer

25 Nov 2013


      Hey All,
For those who are not aware, the WMF is currently attempting to replace the
backend renderer for the Collection extension (mwlib). This is the renderer
that creates the PDFs for the 'Download to PDF' sidebar link and creates
books (downloadable in multipe formats and printable via PediaPress), using
Special:Book. We're taking the data centre migration as our cue to replace
mwlib for several reasons; high among them being the desire to use Parsoid
to do the parsing from wikitext into something usable by an external tool
-- mwlib currently does this conversion internally. This should allow us to
solve several other long standing mwlib issues with respect to the
rendering of non latin languages.
Last week we started work on the new parser, which we're calling the
'Collection Offline Content Generator' or OCG-C. Today I can say that where
we are is promising but by no means complete. We as yet only have basic
support for rendering articles, and a lot of complex articles are failing
to render. For the curious we have an the alpha product [1] and a public
coordination / documentation page [2] -- you can also join us in
#mediawiki-pdfhack.
In broad strokes [3]; our solution is a LVS fronted Node.JS backend cluster
with a Redis job queue. Bundling (content gather from the wiki) and
Rendering are two distinct processes with an intermediate file [4] in
between. Any renderer should be able to pick the intermediate file up and
produce output [5]. We will store bundle files and generated documents
under a short timeout in Swift, and have a somewhat longer frontend cache
period in varnish for the final documents. Deployments will be happening
via Trebuchet, and node dependencies are stored in a seperate git
repository -- much like Parsoid and eventually Mathoid [6].
The Foundation is still partnering with PediaPress to provide print on
demand books. However, bundling and rendering will in future be performed
on their servers.
The team will continue to work on this project over the coming weeks. Big
mileposts in no particular order are table support, puppetization into beta
labs, load testing, and multilingual support. Our plan is have something
that the community can reliably beta test soon with final deployment into
production happening, probably, early January [7]. Decommisioning of the
old servers is expected to happen by late January, so that's our hard
deadline to wrap things up.
Big thanks to Max, Scott, Brad & Jeff for all their help so far, and to
Faidon, Ryan and other ops team members for their support.
If you'd like to help, ping me on IRC, and you'll continue to find us on
#mediawiki-pdfhack !
~ Matt Walker
[1] http://mwalker-enwikinews.instance-proxy.wmflabs.org/Special:Book
[2] https://www.mediawiki.org/wiki/PDF_rendering
[3] More detail available at
https://www.mediawiki.org/wiki/PDF_rendering/Architecture
[4] The format is almost exactly the same as the format mwlib uses, just
with RDF instead of HTML
    https://www.mediawiki.org/wiki/PDF_rendering/Bundle_format
[5] Right now the alpha solution only has a LaTeX renderer, but we have
plans for a native HTML renderer (both for PDF and epub) and the ZIM
community has been in contact with us about their RDF to ZIM renderer.
[6] Mathoid is the LaTeX math renderer that Gabriel wrote which will run on
the same servers as this service. Both falling under this nebulous category
of node based 'Offline Content Generators'
[7] I'm being hazy here because we have other duties to other teams again.
Tuesdays until Jan 1 are my dedicated days to working on this project, and
I become full time again to this project come Jan 1. Erik will reach out to
organize a follow-up sprint.