Steve Bennett wrote:
On 1/23/08, Tim Starling tstarling@wikimedia.org wrote:
The new preprocessor has an intermediate XML representation for pages before template inclusion, and it would be possible to store it. There's a RECOVER_ORIG mode that allows the original wikitext to be recovered from the XML. The problems with using it as a storage format are:
- It's useless as an interchange format since it still depends on
thousands of lines of MediaWiki code to generate HTML from it.
- The XML format, and the details of the transformation, are subject to
change.
- Transformation from wikitext to preprocessed XML is relatively fast, and
will hopefully get faster with further development, so it can be generated on demand for any application that needs it.
Hmm, ok. I'm having a bit of trouble picturing this XML format that includes "preprocessed" wikitext but doesn't have templates substituted? Do you mean that at this stage you've parsed the structure of template and parser functions calls, but haven't yet substituted in the result?
Yes.
I think I agree that such an early stage of processing is not the place to generate an exchange format.
However, the XHTML level seems too late: instead of a neat "image" node, you'd end up with all the DIV tags used to actually display the thing in MediaWiki - as opposed to being an abstract representation.
Anyway, if I have understood the situation, writing an export of an interchange format would just be a lot of work, with no special benefit for us, to solve a problem there is not currently any great demand for. I think.
I don't think it's a lot of work, I think it's plain impossible. I think you should specify your goals and try to come up with a method, rather than specifying a method and hoping it will meet some goals.
In another post:
You've suggested that the XML generated by MediaWiki at 2 is no good as an interchange format, and that 5 is suitable. I was (am?) just wondering about the benefits of splitting the XHTML-generating parser into steps 4 and 5, and making 4 generate an XML interchange format, possibly compatible with wikicreole's.
That may well be possible, but it doesn't meet the goals specified in your original post, or in the PDF. WikiCreole is vastly simpler than MediaWiki wikitext. You can't convert MediaWiki wikitext to WikiCreole, unless you want just about everything to be in extension tags.
What's the WikiCreole equivalent of this?
<span class="plainlinks" style="speech-rate: fast"
[http://en.wikipedia.org/ Wikipedia] </span>
The definition of MediaWiki wikitext inherently references HTML and CSS. It can only be converted to formats with a similar capability set to HTML+CSS.
HTML+CSS is a well-specified format which aims to support output to all media types. It separates structure from presentation and provides for semantic annotation. There is a lot more content available in HTML+CSS than in any of the wikitext markup languages. That's why I wish research efforts were focused on analysis and conversion of this common language. Why would you want to convert directly from one restricted subset of HTML to an even more restricted subset? Why not improve annotation of MediaWiki's HTML output to make it more reuseable, and produce an HTML to MediaWiki wikitext converter?
-- Tim Starling