Hey All,
For those who are not aware, the WMF is currently attempting to replace the backend renderer for the Collection extension (mwlib). This is the renderer that creates the PDFs for the 'Download to PDF' sidebar link and creates books (downloadable in multipe formats and printable via PediaPress), using Special:Book. We're taking the data centre migration as our cue to replace mwlib for several reasons; high among them being the desire to use Parsoid to do the parsing from wikitext into something usable by an external tool -- mwlib currently does this conversion internally. This should allow us to solve several other long standing mwlib issues with respect to the rendering of non latin languages.
Last week we started work on the new parser, which we're calling the 'Collection Offline Content Generator' or OCG-C. Today I can say that where we are is promising but by no means complete. We as yet only have basic support for rendering articles, and a lot of complex articles are failing to render. For the curious we have an the alpha product [1] and a public coordination / documentation page [2] -- you can also join us in #mediawiki-pdfhack.
In broad strokes [3]; our solution is a LVS fronted Node.JS backend cluster with a Redis job queue. Bundling (content gather from the wiki) and Rendering are two distinct processes with an intermediate file [4] in between. Any renderer should be able to pick the intermediate file up and produce output [5]. We will store bundle files and generated documents under a short timeout in Swift, and have a somewhat longer frontend cache period in varnish for the final documents. Deployments will be happening via Trebuchet, and node dependencies are stored in a seperate git repository -- much like Parsoid and eventually Mathoid [6].
The Foundation is still partnering with PediaPress to provide print on demand books. However, bundling and rendering will in future be performed on their servers.
The team will continue to work on this project over the coming weeks. Big mileposts in no particular order are table support, puppetization into beta labs, load testing, and multilingual support. Our plan is have something that the community can reliably beta test soon with final deployment into production happening, probably, early January [7]. Decommisioning of the old servers is expected to happen by late January, so that's our hard deadline to wrap things up.
Big thanks to Max, Scott, Brad & Jeff for all their help so far, and to Faidon, Ryan and other ops team members for their support.
If you'd like to help, ping me on IRC, and you'll continue to find us on #mediawiki-pdfhack !
~ Matt Walker
[1] http://mwalker-enwikinews.instance-proxy.wmflabs.org/Special:Book [2] https://www.mediawiki.org/wiki/PDF_rendering [3] More detail available at https://www.mediawiki.org/wiki/PDF_rendering/Architecture [4] The format is almost exactly the same as the format mwlib uses, just with RDF instead of HTML https://www.mediawiki.org/wiki/PDF_rendering/Bundle_format [5] Right now the alpha solution only has a LaTeX renderer, but we have plans for a native HTML renderer (both for PDF and epub) and the ZIM community has been in contact with us about their RDF to ZIM renderer. [6] Mathoid is the LaTeX math renderer that Gabriel wrote which will run on the same servers as this service. Both falling under this nebulous category of node based 'Offline Content Generators' [7] I'm being hazy here because we have other duties to other teams again. Tuesdays until Jan 1 are my dedicated days to working on this project, and I become full time again to this project come Jan 1. Erik will reach out to organize a follow-up sprint.