Jay R. Ashworth schrieb:
On Tue, May 09, 2006 at 05:26:12PM +0100, Ben Francis wrote:
Jay R. Ashworth wrote:
I've badly wanted to see something like this for some time, and would be glad (with 20 years system analyst experience :-) to kibitz on the design, if you like.
That's cool, thanks.
My personal target was mostly being able to extract a partial tree from a running MW install, and dump it into a DocBook source file,
When you say a "partial" tree, how would you define the boundaries of said tree?
Well, this is where you diverge from the translation part to the extraction part, the part I don't think Magnus' tool deals with yet (though I haven't looked deeply into it).
BTW, the tool has now moved to http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php
The current version can also work as an extension within a MediaWiki installation, where it appears as a Special Page. Just put some link/button into your MediaWiki skin, and you have your export :-)
There are 2 issues that I can see, off hand:
- Limiting what you extract (and handling it properly, viz a
viz, say, "actual pages" vs "glossary items", and suchlike)
and
- Properly handling cross references
When my tool coverts all given articles into XML, it keeps a record of the converted article names. IIRC, the DocBook export then creates internal links to articles that have been converted. This works nicely in DocBook-based HTML and PDFs, for example.
The first can probably be handled by category tagging and some configuration files; the latter will likely require some indepth knowledge of how DocBook handles such things, since you can't do hyperlinks on many of DocBook's target formats (like, um, paper :-), and you can't bind traditional cross-references until you have real page numbers.
Just put an arrow in front of a reference ;-)
On the specific issue of trimming the tree, part of it is going to have to be discipline on the part of the maintainers of the wiki not to introduce loops -- it will likely be necessary to have a pre-pass switch on the driver engine that extracts and displays the "table of contents" in a raw unnumbered mode (in addition, of course, to one that generates a formattable ToC) so you can see if it ever ends.
It might be best to have a starting page (Main Page), use every link on there as level one, every link on these as level two etc. Or use CatScan (on the toolserver as well, somewhere;-) to get a category tree.
A simple way to pass this to my script would then be putting spaces in front of the article names - one space per depth. Leading spaces Should be filtered out right now, but I could make them part of the XML output.
In case you didn't know, my script can do a full conversion, e.g. XML->DocBook->PDF, if configured properly. I have a local test setup on a Window$ machine with some out-of-the-box tools to do the last step, just didn't bother to do that on my test site, since I don't really have experience with DocBook...
Magnus