SoC: MediaWiki Export

List overview All Threads
Download

newer

older

Is Wikipedia ok?

Why not save diffs instead of...

Ben Francis

9 May 2006 9 May '06

7:46 a.m.

Hi,

Since the MediaWiki Client API is no longer a Google Summer of Code project, I've managed to find another proposal which I find really interesting and submitted my application, MediaWiki Export.

Application here: http://ideas.hippygeek.co.uk/wiki/WikimediaSoCApplication More details here: http://ideas.hippygeek.co.uk/wiki/MediaWikiExport

I'm interested in how much work has gone into this already because it would be such a useful feature (both the exporting and the native DocBook storage). Actually, once the API gets implemented there are enormous possibilities for using MediaWiki as a document management tool and calling the API from different internal systems.

An organisation could write documents using a desktop application but use MediaWiki for the back end storage. They could then log in remotely, access and edit the documents from anywhere via a web interface or other application.

Imagine MediaWiki for help systems where the help appears inside a desktop or web application but is stored and collaboratively edited using a MediaWiki installation.

If the XHTML wiki web pages only have to be one representation of the data, the possibilties are endless!

The PHP tool written by Magnus Manske (http://magnusmanske.de/wiki2xml/w2x.php) looks good

Best Wishes

Ben

-- Ben "tola" Francis http://hippygeek.co.uk

Show replies by date

Jay R. Ashworth

9 May 9 May

6:57 a.m.

On Tue, May 09, 2006 at 03:46:00PM +0100, Ben Francis wrote:

...

I'm interested in how much work has gone into this already because it would be such a useful feature (both the exporting and the native DocBook storage). Actually, once the API gets implemented there are enormous possibilities for using MediaWiki as a document management tool and calling the API from different internal systems.

An organisation could write documents using a desktop application but use MediaWiki for the back end storage. They could then log in remotely, access and edit the documents from anywhere via a web interface or other application.

Imagine MediaWiki for help systems where the help appears inside a desktop or web application but is stored and collaboratively edited using a MediaWiki installation.

If the XHTML wiki web pages only have to be one representation of the data, the possibilties are endless!

The PHP tool written by Magnus Manske (http://magnusmanske.de/wiki2xml/w2x.php) looks good

Yeah; at the moment, Magnus seems like the go-to guy on that. I've badly wanted to see something like this for some time, and would be glad (with 20 years system analyst experience :-) to kibitz on the design, if you like.

My personal target was mostly being able to extract a partial tree from a running MW install, and dump it into a DocBook source file, processing xrefs and the like in some useful fashion, so that flat-file documentation can be extracted from a MediaWiki used to maintain it.

(Read: I talked the MythTV people into converting from Moin, and this was one of the selling points. :-)

That usage, of course, implies a few extra requirements that the general case wouldn't require, but I think it's one of the most useful targets for such a processing chain.

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on Usenet and in e-mail?

Ben Francis

9:26 a.m.

Jay R. Ashworth wrote:

...

I've badly wanted to see something like this for some time, and would be glad (with 20 years system analyst experience :-) to kibitz on the design, if you like.

That's cool, thanks.

...

My personal target was mostly being able to extract a partial tree from a running MW install, and dump it into a DocBook source file,

When you say a "partial" tree, how would you define the boundaries of said tree?

Ben -- Ben "tola" Francis http://hippygeek.co.uk

Jay R. Ashworth

10:15 a.m.

On Tue, May 09, 2006 at 05:26:12PM +0100, Ben Francis wrote:

...

Jay R. Ashworth wrote:

...
I've badly wanted to see something like this for some time, and would be glad (with 20 years system analyst experience :-) to kibitz on the design, if you like.

That's cool, thanks.

...
My personal target was mostly being able to extract a partial tree from a running MW install, and dump it into a DocBook source file,

When you say a "partial" tree, how would you define the boundaries of said tree?

Well, this is where you diverge from the translation part to the extraction part, the part I don't think Magnus' tool deals with yet (though I haven't looked deeply into it).

There are 2 issues that I can see, off hand:

1) Limiting what you extract (and handling it properly, viz a viz, say, "actual pages" vs "glossary items", and suchlike)

and

2) Properly handling cross references

The first can probably be handled by category tagging and some configuration files; the latter will likely require some indepth knowledge of how DocBook handles such things, since you can't do hyperlinks on many of DocBook's target formats (like, um, paper :-), and you can't bind traditional cross-references until you have real page numbers.

On the specific issue of trimming the tree, part of it is going to have to be discipline on the part of the maintainers of the wiki not to introduce loops -- it will likely be necessary to have a pre-pass switch on the driver engine that extracts and displays the "table of contents" in a raw unnumbered mode (in addition, of course, to one that generates a formattable ToC) so you can see if it ever ends.

But ignoring everything except the content of the actual page (what you get from &action=raw, which was going to be my approach) is probably the starting point, of course.

Cheers, -- jra

Magnus Manske

10 May 10 May

3:01 a.m.

Jay R. Ashworth schrieb:

...

On Tue, May 09, 2006 at 05:26:12PM +0100, Ben Francis wrote:

...
Jay R. Ashworth wrote:

...
I've badly wanted to see something like this for some time, and would be glad (with 20 years system analyst experience :-) to kibitz on the design, if you like.

That's cool, thanks.

...
My personal target was mostly being able to extract a partial tree from a running MW install, and dump it into a DocBook source file,

When you say a "partial" tree, how would you define the boundaries of said tree?

Well, this is where you diverge from the translation part to the extraction part, the part I don't think Magnus' tool deals with yet (though I haven't looked deeply into it).

BTW, the tool has now moved to http://tools.wikimedia.de/~magnus/wiki2xml/w2x.php

The current version can also work as an extension within a MediaWiki installation, where it appears as a Special Page. Just put some link/button into your MediaWiki skin, and you have your export :-)

...

There are 2 issues that I can see, off hand:

Limiting what you extract (and handling it properly, viz a

viz, say, "actual pages" vs "glossary items", and suchlike)

and

Properly handling cross references

When my tool coverts all given articles into XML, it keeps a record of the converted article names. IIRC, the DocBook export then creates internal links to articles that have been converted. This works nicely in DocBook-based HTML and PDFs, for example.

...

The first can probably be handled by category tagging and some configuration files; the latter will likely require some indepth knowledge of how DocBook handles such things, since you can't do hyperlinks on many of DocBook's target formats (like, um, paper :-), and you can't bind traditional cross-references until you have real page numbers.

Just put an arrow in front of a reference ;-)

...

On the specific issue of trimming the tree, part of it is going to have to be discipline on the part of the maintainers of the wiki not to introduce loops -- it will likely be necessary to have a pre-pass switch on the driver engine that extracts and displays the "table of contents" in a raw unnumbered mode (in addition, of course, to one that generates a formattable ToC) so you can see if it ever ends.

It might be best to have a starting page (Main Page), use every link on there as level one, every link on these as level two etc. Or use CatScan (on the toolserver as well, somewhere;-) to get a category tree.

A simple way to pass this to my script would then be putting spaces in front of the article names - one space per depth. Leading spaces Should be filtered out right now, but I could make them part of the XML output.

In case you didn't know, my script can do a full conversion, e.g. XML->DocBook->PDF, if configured properly. I have a local test setup on a Window$ machine with some out-of-the-box tools to do the last step, just didn't bother to do that on my test site, since I don't really have experience with DocBook...

Magnus

Denny Vrandecic

3:24 a.m.

New subject: Developing an extension in the Mediawiki SVN?

Hello all,

sorry if this Mail should reach you twice. My connection is somewhat less then spectacular these days, and so sometimes I am not sure if a Mail got through or not.

I am [[en:Denny]], one of the developers of the Semantic Mediawiki extension to the Mediawiki code. Right now, we are using the SourceForge-CVS service for the development of our code. As you know, this sucks a whole lot.

We would like to ask kindly if we may move from the SF CVS to the MediaWiki SVN and continue to develop there, as a module. If there are any requirements we have to fulfill, or anything else we have to do, please tell us so. The license is the same as Mediawiki, so I guess there should be no problem with that part. 'twould be great if this were possible?

Thanks in advance, Denny Vrandecic

Tels

11 May 11 May

1:23 a.m.

Moin,

On Tuesday 09 May 2006 16:46, Ben Francis wrote:

...

Hi,

Since the MediaWiki Client API is no longer a Google Summer of Code

[snip]

Just for the record, I am also interested (and working) on the wiki2xml project. The goal is to extract a "tree" from:

http://bloodgate.com/wiki/Wiki-Presentations

automatically, e.g. get all pages and convert them in one go into an OpenOffice (or whatever) document.

At the moment I am still struggling with finding my way around Magnus' code, but eventually we get there :)

The current idea is to use a template which lists all the pages as a source, but of course it would be equally possible to generate the list of pages to extract from either a category, or spidering it from a start page or whatever. Producing the list of articles to be extracted is really a seperate issue from converting one or more articles to another format :)

Best wishes,

Tels

-- Signed on Thu May 11 10:21:14 2006 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email. "Den wahren Wert dieser Software werden vermutlich nur Fach Läute und Firmen erkennen." -- "So isst es. Ein gewißer Standart muss schon gewart beiben!" -- Kabe (http://tinyurl.com/3kucx)

Jay R. Ashworth

8:20 a.m.

On Thu, May 11, 2006 at 10:23:58AM +0200, Tels wrote:

...

Just for the record, I am also interested (and working) on the wiki2xml project. The goal is to extract a "tree" from:

http://bloodgate.com/wiki/Wiki-Presentations

automatically, e.g. get all pages and convert them in one go into an OpenOffice (or whatever) document.

At the moment I am still struggling with finding my way around Magnus' code, but eventually we get there :)

The current idea is to use a template which lists all the pages as a source, but of course it would be equally possible to generate the list of pages to extract from either a category, or spidering it from a start page or whatever. Producing the list of articles to be extracted is really a seperate issue from converting one or more articles to another format :)

Spidering it was my preferred approach, yes, though something someone alluded to which amounted to "spider out the page names into a file (with visible indention) and then run over that file, as amended by a human" rather then automatically spidering directly into the conversion code, is likely a better approach. Indeed, this approach would allow you to select a section numbering protocol and see how it would look before doing The Big Run.

Cheers, -- jra

6792

Age (days ago)

6794

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

5 participants

tags (0)

participants (5)

Ben Francis
Denny Vrandecic
Jay R. Ashworth
Magnus Manske
Tels