Note: I cross-posted this to several lists, because I think this is of interest to many; please reply on wikitech-l only.
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates.
The generated XML can now be converted into other formats. For demonstration [1], I offer "plain text" and DocBook XML.
What I cannot demonstrate (due to limitations of my hosting service) is the subsequence conversion to HTML or PDF from the DocBook XML. However, it is quite easy to set up an automatic conversion locally if you have the necessary DocBook files installed.
As an example, I have generated a PDF [2] by 1. Entering the titles of the articles I want to have 2. Chosing "DocBook PDF" as output format 3. Clicking "Convert" 4. Waiting for the PDF to open Really, that easy! :-)
I am well aware of some shortcomings of the example PDF, however, most of them (no left margin, gigantic tables, misshaped images) are flaws of DocBook, or of the default stylesheets I use. I'm not really familiar with DocBook and hope for help by people that are.
While the converter seems to work pretty well, I'm sure there are lots of fun bugs to find. If you do find a page that breaks, please mail me the title so I can find the bug, or even better, fix it yourself! The code is in CVS, "wiki2xml" module, "php" directory (ignore the old C code in the main directory;-)
A word about speed: Yes, the process of creating a PDF takes some time. However, most of it is DocBook at work, and of course the loading times for articles and templates. Converting the example from wiki markup to XML to DocBook XML to PDF takes 2 minutes 20 seconds total, but the actual conversion wiki-to-XML is done in just 8 seconds.
Apart from bug fixing, my next priority is ODT (OpenOffice) format output. Also, I would like to extend Special:Export in MediaWiki so it can return a list of authors, which can then be added automagically to all converted files.
Awaiting your feedback, Magnus
[1] http://magnusmanske.de/wiki2xml/w2x.php [2] http://magnusmanske.de/wiki2xml/Biology_topics.pdf (3.7 MB!)
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates.
Great thing!
Question: will it generate internal links, ie if i select [[wikipedia]] and [[Jimmy Wales]] as articles, will the links between be "internal", or to the site? It'd be great to have the result xml as "self-contained" as possible.
*dreams of having, on MediaWiki, a way to select articles efficiently and exporting directly*
Nicolas
Nicolas Weeger schrieb:
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates.
Great thing!
Question: will it generate internal links, ie if i select [[wikipedia]] and [[Jimmy Wales]] as articles, will the links between be "internal", or to the site? It'd be great to have the result xml as "self-contained" as possible.
Short answer : It does use self-contained links, and ignores all others.
Long answer: The XML that is generated in the first step doesn't care about link targets at all. The actual output function in the second step (e.g. to DocBook) handles that, so it can be adapted to the format.
Maybe I should turn all non-internal links into external ones (back to the wiki). However, that would also create link to non-existing articles (or I'll have to check each of them across the web, resulting in bandwidth hell ;-)
*dreams of having, on MediaWiki, a way to select articles efficiently and exporting directly*
Soon. Maybe not on Wikipedia (performance reasons), but otherwise - soon.
BTW, I have the parsing times way down by now. [[Biology]] is converted in 0.5 seconds, not counting the times it has to load sources of article/templates from the web. This might become a MediaWiki parser replacement after all...
Magnus
wikipedia-l@lists.wikimedia.org