Steve Bennett wrote:
On 1/23/08, Tim Starling
<tstarling(a)wikimedia.org> wrote:
The new preprocessor has an intermediate XML
representation for pages
before template inclusion, and it would be possible to store it. There's a
RECOVER_ORIG mode that allows the original wikitext to be recovered from
the XML. The problems with using it as a storage format are:
* It's useless as an interchange format since it still depends on
thousands of lines of MediaWiki code to generate HTML from it.
* The XML format, and the details of the transformation, are subject to
change.
* Transformation from wikitext to preprocessed XML is relatively fast, and
will hopefully get faster with further development, so it can be generated
on demand for any application that needs it.
Hmm, ok. I'm having a bit of trouble picturing this XML format that
includes "preprocessed" wikitext but doesn't have templates
substituted? Do you mean that at this stage you've parsed the
structure of template and parser functions calls, but haven't yet
substituted in the result?
Yes.
I think I agree that such an early stage of processing
is not the
place to generate an exchange format.
However, the XHTML level seems too late: instead of a neat "image"
node, you'd end up with all the DIV tags used to actually display the
thing in MediaWiki - as opposed to being an abstract representation.
Anyway, if I have understood the situation, writing an export of an
interchange format would just be a lot of work, with no special
benefit for us, to solve a problem there is not currently any great
demand for. I think.
I don't think it's a lot of work, I think it's plain impossible. I think
you should specify your goals and try to come up with a method, rather
than specifying a method and hoping it will meet some goals.
In another post:
You've suggested that the XML generated by
MediaWiki at 2 is no good
as an interchange format, and that 5 is suitable. I was (am?) just
wondering about the benefits of splitting the XHTML-generating parser
into steps 4 and 5, and making 4 generate an XML interchange format,
possibly compatible with wikicreole's.
That may well be possible, but it doesn't meet the goals specified in your
original post, or in the PDF. WikiCreole is vastly simpler than MediaWiki
wikitext. You can't convert MediaWiki wikitext to WikiCreole, unless you
want just about everything to be in extension tags.
What's the WikiCreole equivalent of this?
<span
class="plainlinks"
style="speech-rate: fast"
[
http://en.wikipedia.org/ Wikipedia]
</span>
The definition of MediaWiki wikitext inherently references HTML and CSS.
It can only be converted to formats with a similar capability set to
HTML+CSS.
HTML+CSS is a well-specified format which aims to support output to all
media types. It separates structure from presentation and provides for
semantic annotation. There is a lot more content available in HTML+CSS
than in any of the wikitext markup languages. That's why I wish research
efforts were focused on analysis and conversion of this common language.
Why would you want to convert directly from one restricted subset of HTML
to an even more restricted subset? Why not improve annotation of
MediaWiki's HTML output to make it more reuseable, and produce an HTML to
MediaWiki wikitext converter?
-- Tim Starling