Hey All,
For those who are not aware, the WMF is currently attempting to replace the backend renderer for the Collection extension (mwlib). This is the renderer that creates the PDFs for the 'Download to PDF' sidebar link and creates books (downloadable in multipe formats and printable via PediaPress), using Special:Book. We're taking the data centre migration as our cue to replace mwlib for several reasons; high among them being the desire to use Parsoid to do the parsing from wikitext into something usable by an external tool -- mwlib currently does this conversion internally. This should allow us to solve several other long standing mwlib issues with respect to the rendering of non latin languages.
Last week we started work on the new parser, which we're calling the 'Collection Offline Content Generator' or OCG-C. Today I can say that where we are is promising but by no means complete. We as yet only have basic support for rendering articles, and a lot of complex articles are failing to render. For the curious we have an the alpha product [1] and a public coordination / documentation page [2] -- you can also join us in #mediawiki-pdfhack.
In broad strokes [3]; our solution is a LVS fronted Node.JS backend cluster with a Redis job queue. Bundling (content gather from the wiki) and Rendering are two distinct processes with an intermediate file [4] in between. Any renderer should be able to pick the intermediate file up and produce output [5]. We will store bundle files and generated documents under a short timeout in Swift, and have a somewhat longer frontend cache period in varnish for the final documents. Deployments will be happening via Trebuchet, and node dependencies are stored in a seperate git repository -- much like Parsoid and eventually Mathoid [6].
The Foundation is still partnering with PediaPress to provide print on demand books. However, bundling and rendering will in future be performed on their servers.
The team will continue to work on this project over the coming weeks. Big mileposts in no particular order are table support, puppetization into beta labs, load testing, and multilingual support. Our plan is have something that the community can reliably beta test soon with final deployment into production happening, probably, early January [7]. Decommisioning of the old servers is expected to happen by late January, so that's our hard deadline to wrap things up.
Big thanks to Max, Scott, Brad & Jeff for all their help so far, and to Faidon, Ryan and other ops team members for their support.
If you'd like to help, ping me on IRC, and you'll continue to find us on #mediawiki-pdfhack !
~ Matt Walker
[1] http://mwalker-enwikinews.instance-proxy.wmflabs.org/Special:Book [2] https://www.mediawiki.org/wiki/PDF_rendering [3] More detail available at https://www.mediawiki.org/wiki/PDF_rendering/Architecture [4] The format is almost exactly the same as the format mwlib uses, just with RDF instead of HTML https://www.mediawiki.org/wiki/PDF_rendering/Bundle_format [5] Right now the alpha solution only has a LaTeX renderer, but we have plans for a native HTML renderer (both for PDF and epub) and the ZIM community has been in contact with us about their RDF to ZIM renderer. [6] Mathoid is the LaTeX math renderer that Gabriel wrote which will run on the same servers as this service. Both falling under this nebulous category of node based 'Offline Content Generators' [7] I'm being hazy here because we have other duties to other teams again. Tuesdays until Jan 1 are my dedicated days to working on this project, and I become full time again to this project come Jan 1. Erik will reach out to organize a follow-up sprint.
Let me talk a little bit about the bundle format, briefly:
* It is intended to be a complete copy of all wiki resources required to make an offline dump, in any format. That means that all the articles are spidered and template-expanded and all related images and other media are fetched and stored in a zip archive. The archive also will contain all license and authorship information needed to make the attributions, etc, needed for a license-compliant rendering. This should provide developers of rendering backends a substantial headstart.
* The current bundle format is backwards compatible with the pediapress bundles. We have made some additions, primarily having to do with better disambiguating table keys/filenames/etc to deal with collections which span multiple wikis. We also add the parsoid parser output.
* The backwards-compatibility features are somewhat experimental. As Matthew noted, the plan is for pediapress to eventually begin hosting their bundler on their own servers. We hope that they will be able to share our bundles, but that decision is up to them. We may deprecate some of the backwards-compatibility content of the bundles (for example, removing the PHP parser output) if no one ends up using them. (None the less, having pediapress' working bundle format was very helpful to me in writing the new bundler, and I want to thank them!)
* I've made a conscious effort to support *very large* bundles in this format. That is, I try not to hold complete data relating to a bundle in memory, and we use sqlite databases wherever possible to support article-at-a-time access during rendering. The MW-hosted servers will probably have reasonably-small resource limits, but it is my intention that if you want to create an offline dump of an entire wiki (or large subset thereof), then you should be able to use the existing renderers and bundler to do so. I'd encourage people interested in making large slices to get in touch and hopefully start playing with the code, so we can identify any bundle-format related bottlenecks and eliminate them before the bundle format is too firmly established. * The bundler (and latex renderer) are independent npm modules, loosely coupled to the Collection extension. Again, this should encourage reuse of the bundler and renderer in other projects. Patches welcome!
http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOffli...
http://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCollection%2FOffli... The npm module name is still in flux. It's currently mw-bundler and mw-latexer, maybe mw-ocg-bundler etc would be better. --scott
On Mon, Nov 25, 2013 at 2:24 PM, C. Scott Ananian cananian@wikimedia.org wrote:
Let me talk a little bit about the bundle format, briefly:
- It is intended to be a complete copy of all wiki resources required to
make an offline dump, in any format. That means that all the articles are spidered and template-expanded and all related images and other media are fetched and stored in a zip archive. The archive also will contain all license and authorship information needed to make the attributions, etc, needed for a license-compliant rendering. This should provide developers of rendering backends a substantial headstart.
Except for these, I suppose: https://bugzilla.wikimedia.org/show_bug.cgi?id=28064 https://bugzilla.wikimedia.org/show_bug.cgi?id=27629
Helder
Thanks, Matt, for the detailed update, as well as for your leadership throughout the project, and thanks to everyone who's helped with the effort so far. :-)
As Matt outlined, we're going to keep moving on critical path issues til January and will do a second sprint then to get things ready for production. Currently we're targeting January 6-January 17 for the second sprint. Will keep you posted.
All best, Erik
On Mon, Nov 25, 2013 at 12:52 AM, Matthew Walker mwalker@wikimedia.orgwrote:
For those who are not aware, the WMF is currently attempting to replace the backend renderer for the Collection extension (mwlib). This is the renderer that creates the PDFs for the 'Download to PDF' sidebar link and creates books (downloadable in multipe formats and printable via PediaPress), using Special:Book.
Would you like to see Selenium tests for this feature? We have a few Google Code-in students that are resolving tasks that we give them faster than we can create new tasks. If you would like to see tests for this, please let me know as soon as possible. Now is already too late! :)
Even better, create a bug (and send me the link) with examples what needs to be tested.
Something like this:
- when I create a pdf from an empty page, an empty pdf file should be created - when I create a pdf from a page that has a title and one paragraph, the pdf should contain the title and one paragraph of text - and so on...
I do not know what really needs to be tested, this was just a few simple ideas.
Željko
On Wed, Nov 27, 2013 at 8:12 PM, Željko Filipin zfilipin@wikimedia.orgwrote:
Even better, create a bug (and send me the link) with examples what needs to be tested.
Looks like the bug[1] already exists. I will create a task for code-in students to create a few simple tests. If more tests are needed, let me know.
Željko -- 1: https://bugzilla.wikimedia.org/show_bug.cgi?id=46224
wikitech-l@lists.wikimedia.org