Re: [Wikitech-l] RfC: PDF and document management for MediaWiki

5 Jan 2007

      Hi Erik,
I've previously given something along these lines some thought, and rapidly came to the conclusion that it's a huge amount of work,
but yet potentially exceedingly useful (assuming it scales up to something that can handle book-sized documents).
Essentially, something that exports a list of articles you give it into PDF format, and then you can print the PDF file, or have
someone else professionally print the PDF for you into a nicely-bound book format, or you can just read the PDF electronically.
So why would anyone do this? To see why, imagine a world in which:
* All teachers have complete control over their textbooks, instead of having them handed down from on high.
* Where the content in those books is build by clicking together prefabricated blocks of text (namely articles, or sections in
articles), like bits of Lego.
* Where the content that makes up those articles is (hopefully) free and public (such as the Wikipedia), such that any corrections
or updates or expansions are available to all (including other teachers).
* Where improvised high-school or uni students no longer have to pay $50 for a textbook that they're only going to read bits of (and
only probably then right before their exam) - instead they can get the PDF and read it electronically, or they can just print out
the bits they want, or if they really want the book they can presumably get a better price than $50 for getting it printed because
there can be competition for printing/binding in a way that there isn't currently for textbooks.
In practical terms, a use-case something along these lines was what I had in mind:
* User logs into MediaWiki, and goes to a new extension page, possibly called [[Special:MakeMyBook]], which allows tagging multiple
articles for PDF export.
* The user can create a new document, which allows ordering / reordering the articles as desired.
* Ideally, it would also allow including excerpts from articles (e.g. Labelled Section Transclusion).
* Ideally, it would also allow using particular versions of articles, or maybe even better, the latest stable version of the
article.
* Ideally, it would also allow including content from other remote wikis too, not just the local wiki, and it would respect the
copyright terms of those wikis. For example, suppose that Wikitravel or Wikia or Wikisource had some great material that
complemented some GFDL material in the Wikipedia. It would be really powerful to be able to have those bits of content side-by-side
(assuming the licenses were compatible), and have it automatically include all the necessary license-term legalese in an appendix in
5-point font. Yuri's API already includes a mechanism for getting the content license from a wiki (assuming the remote wiki is
running a recent SVN revision).
* Once happy with the structure, the user would then click a "make my book" or "Engage!" button.
* Some server-side job would then go off and do the processing, and email the user when it's done and/or leave them a message on
their talk page.
* The user could then download the resulting PDF file from an FTP or web site, and presumably after a while the file would be
deleted to free up disk space.
* The document structure / index should be saved and public, so that it can be viewed by others (who are interested in the same
topic), or worked on by others (when collaboratively making a document), or modified by the original author (if they want to update
or expand the document at a later date).
The "PDF link" in the sidebar for exporting just the current article is a great idea that I hadn't even considered, but building
books I think is probably the most strategically important application. Along the same vein though, you could maybe have an "Add to
book" link, which tacks the currently active article onto the end of the currently active PDF document / book (so that people could
browse around pages related to a topic, and tag the stuff they found relevant).
The three main issues that come to mind though are:
* CPU power + disk space requirements: If it gets used in the way I would hope it would be used, it's going to get used a lot, and
creating/optimizing large PDF files is not a computationally cheap operation. Multiply that by many users, making multiple revisions
of each book, and you have the potential for a huge backlog of tasks, each producing some very large files. So even if it was
super-efficient, it'd I expect need some serious CPU power, plus have large disk space requirements.
* How to do the actual conversion to PDF (discussed below).
* Coming up with a decent user-interface for allowing the user to add / move / delete / rearrange content (or maybe just start out
with an unordered list in text format, to start simple), and working out where to store that information (e.g. Should it get its own
namespace, rather than cluttering up the current namespaces?)
... And then of course, there's the small issue of actually building the damn thing, once it's determined what's being built, and
that it can theoretically be built ;-)
...

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g.
wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax,
its various extensions, and the need to keep up with parser changes?
Well, I was wondering if it was possible to "cheat", and avoiding the whole compatibility problem, by doing something like this:
+--------------------+
| Private web server |
+--------------------+
  |
  |
  v
+--------------------+
|     MediaWiki      |
+--------------------+
  |
  |
  v
+--------------------+
|  Embedded Firefox  |
+--------------------+
  |
  |  Automatically print
  v
+--------------------+
|     PDF export     |
+--------------------+
I.e. a quasi-embedded version of what already happens, just all on one box, instead of over-the-network, and behaving like an
integrated pipeline, instead of as separate independent bits. The benefit of this is that you can use already-existing known-working
software, and avoid reinventing the wheel. And you don't have to worry about keeping compatibility with the parser - just leave it
as MediaWiki's problem to convert wikitext to XHTML. The downsides of this are:
* It may not even be possible to do this in a sensible way (i.e. to take these currently completely separate bits of software, and
embed them into one large process, which takes wiki-text as input, and spits out PDF at the end), although my current guess is that
it probably is (with a lot of hacking), but that's just a guess.
* You probably wouldn't get the "Support filters on the rendered HTML" functionality (or at the very least it might make it harder),
so thumbnails on the current print output would look like thumbnails on the PDF output.
* Wouldn't get internal PDF hyperlinks (e.g. clicking on something in the index to take you to somewhere in the body probably
wouldn't work).
* Would probably produce lots of separate intermediate PDF files (one per article), so there would need to be a step for combining
many separate PDFs together into one large one, whilst not having blank gaps between articles; Alternatively would have to create
one huge document which includes all of the required articles, and print that in one go, which would avoid this problem, but might
be slow (just as rendering a 200 page document through MediaWiki at the moment would be slow) and use lots of RAM.
BTW, the best "see also" web links for this would probably be http://meta.wikimedia.org/wiki/Paper_Wikipedia and
http://pediapress.com/ ; The pediapress example in particular is interesting as they have a working commercial implementation of
some of this stuff, but it has a number of drawbacks, such as a) only seems to be for English Wikipedia b) was using a dump from
July 2006 last time I looked, so very out of date, plus errors frozen in time and cannot be removed or corrected c) the preview PDF
they give you has "SAMPLE" stamped across every page in red letters d) they're professional printers so they really want to sell
printed material, not create PDF files e) can't reorder articles, everything is in alphabetical order only f) can't include partial
content, such as sections g) can't pull content from non-local wikis h) can't see a way to collaborate on books or share with others
the stuff you have worked on but not yet finished. That said, their site is still kinda neat, and certainly worth looking at to see
what works well and what doesn't work so well.
All the best,
Nick.
...
-----Original Message-----
From: wikitech-l-bounces@wikimedia.org
[mailto:wikitech-l-bounces@wikimedia.org]On Behalf Of Erik Moeller
Sent: Friday, 5 January 2007 3:02 PM
To: Wikimedia developers; MediaWiki announcements and site admin list
Subject: [Wikitech-l] RfC: PDF and document management for MediaWiki
I have identified an organization which is willing to spend up to
about EUR 10,000 on adding support for exporting MediaWiki pages as
PDF files, and improving document management for documents consisting
of multiple pages.
My current thinking is that the functionality implemented, as a
minimum, would be as follows:
a) Using an extension, integrate of a "PDF link" on any wiki page
which would call an external library like HTMLDOC on a single wiki
page
b) Support filters on the rendered HTML (replacing image thumbnails
with high resolution images, filter content by regular expression,
etc.), and revision filters (export last revision edited by user on
whitelist Y, or approximating currentdate-Z)
c) Create a "PDF basket" UI which makes it possible to compile a PDF
from multiple pages easily (and rearrange the pages in a hierarchy).
The resulting structures could potentially also be stored as wikitext,
using a new <structure> extension tag, so that they can be used both
by individuals compiling PDFs for personal use, and by groups
collaborating on complex documents.
Possibly some budget could also be allocated for improving the
external PDF library used, especially if we can allocate additional
funds for this project.
I'd like to request comments on this approach, specifically:

Besides HTMLDOC, do you know a good (X)HTML-to-PDF library which

could be used for this purpose?

Within this budget, do you believe an alternative approach which

utilizes an intermediate format is viable (e.g.
wiki-to-Docbook-to-PDF), given the complexity of the MediaWiki syntax,
its various extensions, and the need to keep up with parser changes?

If you are a developer, would you be interested in working on this

project, and available to do so? (If so, please contact me privately.)
Any other comments would also be appreciated.
Peace & Love,
Erik
DISCLAIMER: This message does not represent an official position of
the Wikimedia Foundation or its Board of Trustees.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RfC: PDF and document management for MediaWiki

Any other comments would also be appreciated.