Note: I cross-posted this to several lists, because I think this is of interest to many; please reply on wikitech-l only.
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates.
The generated XML can now be converted into other formats. For demonstration [1], I offer "plain text" and DocBook XML.
What I cannot demonstrate (due to limitations of my hosting service) is the subsequence conversion to HTML or PDF from the DocBook XML. However, it is quite easy to set up an automatic conversion locally if you have the necessary DocBook files installed.
As an example, I have generated a PDF [2] by 1. Entering the titles of the articles I want to have 2. Chosing "DocBook PDF" as output format 3. Clicking "Convert" 4. Waiting for the PDF to open Really, that easy! :-)
I am well aware of some shortcomings of the example PDF, however, most of them (no left margin, gigantic tables, misshaped images) are flaws of DocBook, or of the default stylesheets I use. I'm not really familiar with DocBook and hope for help by people that are.
While the converter seems to work pretty well, I'm sure there are lots of fun bugs to find. If you do find a page that breaks, please mail me the title so I can find the bug, or even better, fix it yourself! The code is in CVS, "wiki2xml" module, "php" directory (ignore the old C code in the main directory;-)
A word about speed: Yes, the process of creating a PDF takes some time. However, most of it is DocBook at work, and of course the loading times for articles and templates. Converting the example from wiki markup to XML to DocBook XML to PDF takes 2 minutes 20 seconds total, but the actual conversion wiki-to-XML is done in just 8 seconds.
Apart from bug fixing, my next priority is ODT (OpenOffice) format output. Also, I would like to extend Special:Export in MediaWiki so it can return a list of authors, which can then be added automagically to all converted files.
Awaiting your feedback, Magnus
[1] http://magnusmanske.de/wiki2xml/w2x.php [2] http://magnusmanske.de/wiki2xml/Biology_topics.pdf (3.7 MB!)
On 3/22/06, Magnus Manske magnus.manske@web.de wrote:
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates.
This sounds brilliant. However, I must ask a stupid question: If it can take text from *any* mediawiki based site (ie, one not running the .PHP script), then why is it written in PHP? Could you explain how you deploy it?
Steve
Steve Bennett schrieb:
On 3/22/06, Magnus Manske magnus.manske@web.de wrote:
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML. I believe it is now feature-complete and relatively reliable. Not only can it process a single wiki text, but a list of articles, taking the text from any MediaWiki-based site online. It uses the same method to replace templates.
This sounds brilliant. However, I must ask a stupid question: If it can take text from *any* mediawiki based site (ie, one not running the .PHP script), then why is it written in PHP?
It is currently a standalone application. I wrote it in PHP so it can be easily integrated into MediaWiki as an extension one day.
Could you explain how you deploy it?
I run it on my local Apache/PHP environment. It accesses the MediaWiki installation I want through the web, retrieving text via "&action=raw". I plan to change that to access via "Special:Export" soon.
Magnus
--- Magnus Manske magnus.manske@web.de wrote:
A long, long time ago, I started writing a PHP script to convert MediaWiki markup into XML.
I am looking forward to this feature. I use and promote the use of wikis in corporate environments, and one of the problems is that there is no easy way to publish a wiki to paper.
As you may know, corporations love paper copies.
Ultimately, the paper version won't be used very much, but having the ability to produce it would go a long way towards the suits saying aye.
Thanks again Magnus for your hard work.
Chris Mahan 818.943.1850 cell chris_mahan@yahoo.com chris.mahan@gmail.com http://www.christophermahan.com/
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
wikitech-l@lists.wikimedia.org