RfC: PDF and document management for MediaWiki

List overview All Threads
Download

newer

older

[ADHESION WIKIMEDIA FRANCE] Rappel

Wikipedia Categories & Namespace?

Erik Moeller

4 Jan 2007 4 Jan '07

8:01 p.m.

I have identified an organization which is willing to spend up to about EUR 10,000 on adding support for exporting MediaWiki pages as PDF files, and improving document management for documents consisting of multiple pages.

My current thinking is that the functionality implemented, as a minimum, would be as follows: a) Using an extension, integrate of a "PDF link" on any wiki page which would call an external library like HTMLDOC on a single wiki page b) Support filters on the rendered HTML (replacing image thumbnails with high resolution images, filter content by regular expression, etc.), and revision filters (export last revision edited by user on whitelist Y, or approximating currentdate-Z) c) Create a "PDF basket" UI which makes it possible to compile a PDF from multiple pages easily (and rearrange the pages in a hierarchy). The resulting structures could potentially also be stored as wikitext, using a new <structure> extension tag, so that they can be used both by individuals compiling PDFs for personal use, and by groups collaborating on complex documents.

Possibly some budget could also be allocated for improving the external PDF library used, especially if we can allocate additional funds for this project.

I'd like to request comments on this approach, specifically: - Besides HTMLDOC, do you know a good (X)HTML-to-PDF library which could be used for this purpose? - Within this budget, do you believe an alternative approach which utilizes an intermediate format is viable (e.g. wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax, its various extensions, and the need to keep up with parser changes? - If you are a developer, would you be interested in working on this project, and available to do so? (If so, please contact me privately.)

Any other comments would also be appreciated.

-- Peace & Love, Erik DISCLAIMER: This message does not represent an official position of the Wikimedia Foundation or its Board of Trustees.

Show replies by date

Lars Aronsson

4 Jan 4 Jan

11:02 p.m.

Erik Moeller wrote:

...

for exporting MediaWiki pages as PDF files, and improving document management for documents consisting of multiple pages. [...] c) Create a "PDF basket" UI which makes it possible to compile a PDF from multiple pages easily (and rearrange the pages in a hierarchy).

Things to throw into this basket are Wikisource and Wikibooks. These projects currently consist of thousands of "pages", but there is no way to separate a "book" (or title or volume) from the rest of the site. For example, at download.wikimedia.org I can download the entire French Wikisource in XML, but I cannot download the XML (or PDF) for just one of the books there.

If I'm lucky, all wiki pages (facsimile pages or chapters) that belong to one book are subpages of a single page, or all have page names that begin with a common prefix, but the current Mediawiki software doesn't help in maintaining this integrity. Clicking around to manually fill a PDF basket would be a nightmare.

Of course, for a Wikisource book with scanned images, I'd like these printed in the PDF. But the OCR text would be used for searching through the PDF.

As far as I know, the ProofreadPage extension (the Page: namespace) doesn't support a grouping or table of contents of all pages belonging to a book. This would be a useful next step.

...

The resulting structures could potentially also be stored as wikitext, using a new <structure> extension tag, so that they can be used both by individuals compiling PDFs for personal use, and by groups collaborating on complex documents.

This could be it.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Nick Jenkins

11:14 p.m.

Hi Erik,

I've previously given something along these lines some thought, and rapidly came to the conclusion that it's a huge amount of work, but yet potentially exceedingly useful (assuming it scales up to something that can handle book-sized documents).

Essentially, something that exports a list of articles you give it into PDF format, and then you can print the PDF file, or have someone else professionally print the PDF for you into a nicely-bound book format, or you can just read the PDF electronically.

So why would anyone do this? To see why, imagine a world in which: * All teachers have complete control over their textbooks, instead of having them handed down from on high. * Where the content in those books is build by clicking together prefabricated blocks of text (namely articles, or sections in articles), like bits of Lego. * Where the content that makes up those articles is (hopefully) free and public (such as the Wikipedia), such that any corrections or updates or expansions are available to all (including other teachers). * Where improvised high-school or uni students no longer have to pay $50 for a textbook that they're only going to read bits of (and only probably then right before their exam) - instead they can get the PDF and read it electronically, or they can just print out the bits they want, or if they really want the book they can presumably get a better price than $50 for getting it printed because there can be competition for printing/binding in a way that there isn't currently for textbooks.

In practical terms, a use-case something along these lines was what I had in mind: * User logs into MediaWiki, and goes to a new extension page, possibly called [[Special:MakeMyBook]], which allows tagging multiple articles for PDF export. * The user can create a new document, which allows ordering / reordering the articles as desired. * Ideally, it would also allow including excerpts from articles (e.g. Labelled Section Transclusion). * Ideally, it would also allow using particular versions of articles, or maybe even better, the latest stable version of the article. * Ideally, it would also allow including content from other remote wikis too, not just the local wiki, and it would respect the copyright terms of those wikis. For example, suppose that Wikitravel or Wikia or Wikisource had some great material that complemented some GFDL material in the Wikipedia. It would be really powerful to be able to have those bits of content side-by-side (assuming the licenses were compatible), and have it automatically include all the necessary license-term legalese in an appendix in 5-point font. Yuri's API already includes a mechanism for getting the content license from a wiki (assuming the remote wiki is running a recent SVN revision). * Once happy with the structure, the user would then click a "make my book" or "Engage!" button. * Some server-side job would then go off and do the processing, and email the user when it's done and/or leave them a message on their talk page. * The user could then download the resulting PDF file from an FTP or web site, and presumably after a while the file would be deleted to free up disk space. * The document structure / index should be saved and public, so that it can be viewed by others (who are interested in the same topic), or worked on by others (when collaboratively making a document), or modified by the original author (if they want to update or expand the document at a later date).

The "PDF link" in the sidebar for exporting just the current article is a great idea that I hadn't even considered, but building books I think is probably the most strategically important application. Along the same vein though, you could maybe have an "Add to book" link, which tacks the currently active article onto the end of the currently active PDF document / book (so that people could browse around pages related to a topic, and tag the stuff they found relevant).

The three main issues that come to mind though are: * CPU power + disk space requirements: If it gets used in the way I would hope it would be used, it's going to get used a lot, and creating/optimizing large PDF files is not a computationally cheap operation. Multiply that by many users, making multiple revisions of each book, and you have the potential for a huge backlog of tasks, each producing some very large files. So even if it was super-efficient, it'd I expect need some serious CPU power, plus have large disk space requirements. * How to do the actual conversion to PDF (discussed below). * Coming up with a decent user-interface for allowing the user to add / move / delete / rearrange content (or maybe just start out with an unordered list in text format, to start simple), and working out where to store that information (e.g. Should it get its own namespace, rather than cluttering up the current namespaces?)

... And then of course, there's the small issue of actually building the damn thing, once it's determined what's being built, and that it can theoretically be built ;-)

...

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g. wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax, its various extensions, and the need to keep up with parser changes?

Well, I was wondering if it was possible to "cheat", and avoiding the whole compatibility problem, by doing something like this:

+--------------------+ | Private web server | +--------------------+ | | v +--------------------+ | MediaWiki | +--------------------+ | | v +--------------------+ | Embedded Firefox | +--------------------+ | | Automatically print v +--------------------+ | PDF export | +--------------------+

I.e. a quasi-embedded version of what already happens, just all on one box, instead of over-the-network, and behaving like an integrated pipeline, instead of as separate independent bits. The benefit of this is that you can use already-existing known-working software, and avoid reinventing the wheel. And you don't have to worry about keeping compatibility with the parser - just leave it as MediaWiki's problem to convert wikitext to XHTML. The downsides of this are: * It may not even be possible to do this in a sensible way (i.e. to take these currently completely separate bits of software, and embed them into one large process, which takes wiki-text as input, and spits out PDF at the end), although my current guess is that it probably is (with a lot of hacking), but that's just a guess. * You probably wouldn't get the "Support filters on the rendered HTML" functionality (or at the very least it might make it harder), so thumbnails on the current print output would look like thumbnails on the PDF output. * Wouldn't get internal PDF hyperlinks (e.g. clicking on something in the index to take you to somewhere in the body probably wouldn't work). * Would probably produce lots of separate intermediate PDF files (one per article), so there would need to be a step for combining many separate PDFs together into one large one, whilst not having blank gaps between articles; Alternatively would have to create one huge document which includes all of the required articles, and print that in one go, which would avoid this problem, but might be slow (just as rendering a 200 page document through MediaWiki at the moment would be slow) and use lots of RAM.

BTW, the best "see also" web links for this would probably be http://meta.wikimedia.org/wiki/Paper_Wikipedia and http://pediapress.com/ ; The pediapress example in particular is interesting as they have a working commercial implementation of some of this stuff, but it has a number of drawbacks, such as a) only seems to be for English Wikipedia b) was using a dump from July 2006 last time I looked, so very out of date, plus errors frozen in time and cannot be removed or corrected c) the preview PDF they give you has "SAMPLE" stamped across every page in red letters d) they're professional printers so they really want to sell printed material, not create PDF files e) can't reorder articles, everything is in alphabetical order only f) can't include partial content, such as sections g) can't pull content from non-local wikis h) can't see a way to collaborate on books or share with others the stuff you have worked on but not yet finished. That said, their site is still kinda neat, and certainly worth looking at to see what works well and what doesn't work so well.

All the best, Nick.

...

-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org]On Behalf Of Erik Moeller Sent: Friday, 5 January 2007 3:02 PM To: Wikimedia developers; MediaWiki announcements and site admin list Subject: [Wikitech-l] RfC: PDF and document management for MediaWiki

I have identified an organization which is willing to spend up to about EUR 10,000 on adding support for exporting MediaWiki pages as PDF files, and improving document management for documents consisting of multiple pages.

My current thinking is that the functionality implemented, as a minimum, would be as follows: a) Using an extension, integrate of a "PDF link" on any wiki page which would call an external library like HTMLDOC on a single wiki page b) Support filters on the rendered HTML (replacing image thumbnails with high resolution images, filter content by regular expression, etc.), and revision filters (export last revision edited by user on whitelist Y, or approximating currentdate-Z) c) Create a "PDF basket" UI which makes it possible to compile a PDF from multiple pages easily (and rearrange the pages in a hierarchy). The resulting structures could potentially also be stored as wikitext, using a new <structure> extension tag, so that they can be used both by individuals compiling PDFs for personal use, and by groups collaborating on complex documents.

Possibly some budget could also be allocated for improving the external PDF library used, especially if we can allocate additional funds for this project.

I'd like to request comments on this approach, specifically:

Besides HTMLDOC, do you know a good (X)HTML-to-PDF library which

could be used for this purpose?

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g. wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax, its various extensions, and the need to keep up with parser changes?

If you are a developer, would you be interested in working on this

project, and available to do so? (If so, please contact me privately.)

Any other comments would also be appreciated.

Peace & Love, Erik

DISCLAIMER: This message does not represent an official position of the Wikimedia Foundation or its Board of Trustees. _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Jared Williams

5 Jan 5 Jan

5:54 a.m.

New subject: [Mediawiki-l] RfC: PDF and document management for MediaWiki

...

-----Original Message----- From: mediawiki-l-bounces@Wikimedia.org [mailto:mediawiki-l-bounces@Wikimedia.org] On Behalf Of Erik Moeller Sent: 05 January 2007 04:02 To: Wikimedia developers; MediaWiki announcements and site admin list Subject: [Mediawiki-l] RfC: PDF and document management for MediaWiki

I have identified an organization which is willing to spend up to about EUR 10,000 on adding support for exporting MediaWiki pages as PDF files, and improving document management for documents consisting of multiple pages.

My current thinking is that the functionality implemented, as a minimum, would be as follows: a) Using an extension, integrate of a "PDF link" on any wiki page which would call an external library like HTMLDOC on a single wiki page b) Support filters on the rendered HTML (replacing image thumbnails with high resolution images, filter content by regular expression, etc.), and revision filters (export last revision edited by user on whitelist Y, or approximating currentdate-Z) c) Create a "PDF basket" UI which makes it possible to compile a PDF from multiple pages easily (and rearrange the pages in a hierarchy). The resulting structures could potentially also be stored as wikitext, using a new <structure> extension tag, so that they can be used both by individuals compiling PDFs for personal use, and by groups collaborating on complex documents.

Possibly some budget could also be allocated for improving the external PDF library used, especially if we can allocate additional funds for this project.

I'd like to request comments on this approach, specifically:

Besides HTMLDOC, do you know a good (X)HTML-to-PDF library which

could be used for this purpose?

(X)HTML transformed with XSL to XSL-FO and then use Apache FOP for PDF generation.

...

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g. wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax, its various extensions, and the need to keep up with parser changes?

The standard docbook tools perform the same process as above, but start with DocBook which is transformed with XSL to XSL-FO.

Jared

Jay R. Ashworth

10 a.m.

On Fri, Jan 05, 2007 at 05:01:45AM +0100, Erik Moeller wrote:

...

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g. wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax, its various extensions, and the need to keep up with parser changes?

Some work has already been done on this front; check the archives. It's one of the reasons I encouraged MythTV to move from Moin to MW.

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Magnus Manske

1:21 p.m.

On 1/5/07, Jay R. Ashworth jra@baylink.com wrote:

...

On Fri, Jan 05, 2007 at 05:01:45AM +0100, Erik Moeller wrote:

...

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g. wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax, its various extensions, and the need to keep up with parser changes?

Some work has already been done on this front; check the archives. It's one of the reasons I encouraged MythTV to move from Moin to MW.

I also have done some work on my wiki-to-xml-to-stuff converter. It's not up-to-date regarding the last changes to the parser (if-templates etc) but does the basics pretty well.

http://127.0.0.1/wiki2xml/php/w2x.php

It probably has many small bugs, but the last time I used it, it ran pretty solid. It comes with a MediaWiki extension framework, so it can run both as a separate installation and/or as a MW extension. If you install it locally, you can set it up for automatic PDF creation via docbook.

Magnus

Mathias Schindler

1:33 p.m.

On 1/5/07, Magnus Manske magnusmanske@googlemail.com wrote:

...

I also have done some work on my wiki-to-xml-to-stuff converter. It's not up-to-date regarding the last changes to the parser (if-templates etc) but does the basics pretty well.

http://127.0.0.1/wiki2xml/php/w2x.php

clicking on that link only gives me lots of pr0n and... oh. nevermind.

Mathias

Magnus Manske

6 Jan 6 Jan

2:20 a.m.

On 1/5/07, Mathias Schindler mathias.schindler@gmail.com wrote:

...

On 1/5/07, Magnus Manske magnusmanske@googlemail.com wrote:

...
I also have done some work on my wiki-to-xml-to-stuff converter. It's not up-to-date regarding the last changes to the parser (if-templates etc) but does the basics pretty well.

http://127.0.0.1/wiki2xml/php/w2x.php

clicking on that link only gives me lots of pr0n and... oh. nevermind.

D'oh! Shoulda hide the pr0n better next time ;-)

http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php

Magnus

Mark Clements

5 Jan 5 Jan

1:52 p.m.

New subject: PDF and document management for MediaWiki

"Erik Moeller" erik@wikimedia.org wrote in message news:b80736c80701042001t58c3a7f9naf68c1e70045885@mail.gmail.com...

...

c) Create a "PDF basket" UI which makes it possible to compile a PDF from multiple pages easily (and rearrange the pages in a hierarchy). The resulting structures could potentially also be stored as wikitext, using a new <structure> extension tag, so that they can be used both by individuals compiling PDFs for personal use, and by groups collaborating on complex documents.

Why not just use the current page transclusion method.

If I want to create "The Human Body" I could create a page in my userspace: "User:HappyDog/Body" and transclude the appropriate articles:

etc.

I pick the articles, and pick the order. I can insert my own headings and specialised introduction, create a TOC manually (suppressing the normal one using the __NOTOC__ magic word), and all the other stuff that is so easy to do in Wiki text. Then I can just 'Export PDF' to get my book.

Collections of information that may be useful to many people can be created in the same way, but in a 'public' namespace (which will vary according to the wiki in question).

I would suggest an addition to the transclusion syntax that allows the specification of a particular revision, e.g. {{:body|1726}} to give revision 1726 (and yes, I am aware that pipe syntax will not work - I don't know the parser well enough to suggest an alternative), but aside from that I think the mechanisms we already have offer a much more simple and flexible approach than a dedicated 'basket'.

- Mark Clements (HappyDog)

6533

Age (days ago)

6534

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

8 participants

tags (0)

participants (8)

Erik Moeller
Jared Williams
Jay R. Ashworth
Lars Aronsson
Magnus Manske
Mark Clements
Mathias Schindler
Nick Jenkins