Internal representation

List overview All Threads
Download

newer

older

Wikimedia hiring a software...

Re: [Wikitech-l] [MediaWiki-CVS]...

Steve Bennett

22 Jan 2008 22 Jan '08

11:47 p.m.

I was just reading this: http://www.riehle.org/wp-content/uploads/2008/01/a5-junghans.pdf

And wondering if there is any desire (let alone plans) to move to a system of storing a different internal representation (eg, XML) and separating the display logic out. One obvious benefit would be making it easier to produce different outputs without having to write multiple parsers. Are there others? Would Wikipedia benefit from supporting an interchange format?

Just fishing.

Steve

Show replies by date

Tim Starling

23 Jan 23 Jan

2:22 a.m.

Steve Bennett wrote:

...

I was just reading this: http://www.riehle.org/wp-content/uploads/2008/01/a5-junghans.pdf

And wondering if there is any desire (let alone plans) to move to a system of storing a different internal representation (eg, XML) and separating the display logic out. One obvious benefit would be making it easier to produce different outputs without having to write multiple parsers. Are there others? Would Wikipedia benefit from supporting an interchange format?

It's entirely impossible as stated, due to the existence of the preprocessing step. Changing a template or variable may radically change the HTML document tree, generating changes distant from the template invocation.

The new preprocessor has an intermediate XML representation for pages before template inclusion, and it would be possible to store it. There's a RECOVER_ORIG mode that allows the original wikitext to be recovered from the XML. The problems with using it as a storage format are:

* It's useless as an interchange format since it still depends on thousands of lines of MediaWiki code to generate HTML from it. * The XML format, and the details of the transformation, are subject to change. * Transformation from wikitext to preprocessed XML is relatively fast, and will hopefully get faster with further development, so it can be generated on demand for any application that needs it.

If you just want an open interchange format for fully preprocessed, template-free wikitext, then MediaWiki already has one. It's called XHTML.

-- Tim Starling

Steve Bennett

24 Jan 24 Jan

9:54 a.m.

On 1/23/08, Tim Starling tstarling@wikimedia.org wrote:

...

The new preprocessor has an intermediate XML representation for pages before template inclusion, and it would be possible to store it. There's a RECOVER_ORIG mode that allows the original wikitext to be recovered from the XML. The problems with using it as a storage format are:

It's useless as an interchange format since it still depends on

thousands of lines of MediaWiki code to generate HTML from it.

The XML format, and the details of the transformation, are subject to

change.

Transformation from wikitext to preprocessed XML is relatively fast, and

will hopefully get faster with further development, so it can be generated on demand for any application that needs it.

Hmm, ok. I'm having a bit of trouble picturing this XML format that includes "preprocessed" wikitext but doesn't have templates substituted? Do you mean that at this stage you've parsed the structure of template and parser functions calls, but haven't yet substituted in the result?

I think I agree that such an early stage of processing is not the place to generate an exchange format.

However, the XHTML level seems too late: instead of a neat "image" node, you'd end up with all the DIV tags used to actually display the thing in MediaWiki - as opposed to being an abstract representation.

Anyway, if I have understood the situation, writing an export of an interchange format would just be a lot of work, with no special benefit for us, to solve a problem there is not currently any great demand for. I think.

Steve

Thomas Dalton

4:27 p.m.

...

However, the XHTML level seems too late: instead of a neat "image" node, you'd end up with all the DIV tags used to actually display the thing in MediaWiki - as opposed to being an abstract representation.

DIV tags *are* part of the abstract representation - it's the CSS that handles the display.

Steve Bennett

5:36 p.m.

On 1/25/08, Thomas Dalton thomas.dalton@gmail.com wrote:

...

...
However, the XHTML level seems too late: instead of a neat "image" node, you'd end up with all the DIV tags used to actually display the thing in MediaWiki - as opposed to being an abstract representation.

DIV tags *are* part of the abstract representation - it's the CSS that handles the display.

Well, not as abstract as say the new WikiCreole interchange format:

<xsd:complexType name="imageType"> xsd:sequence <xsd:element name="uri" type="xsd:string"/> <xsd:element name="alternative" type="simpletextType" minOccurs="0" maxOccurs="1"/> </xsd:sequence> </xsd:complexType>

There is then an XSLT layer to convert from that to actual XHTML: <xsl:template match="image"> <xsl:text disable-output-escaping="yes"><img src="</xsl:text> <xsl:value-of select="uri"/> <xsl:text disable-output-escaping="yes">"/></xsl:text> </xsl:template>

So it looks to me like there are the following layers in a conversion from wikitext to a rendered page:

1. Raw wikitext 2. Pre-processed wikitext before template transclusion 3. Pre-processed wikitext with template transclusion 4. Parsed wikitext into some abstract representation that understands 'bold' and 'image' but doesn't specify display 5. XHTML 6. Visual interpretation of the XHTML as performed by the browser

You've suggested that the XML generated by MediaWiki at 2 is no good as an interchange format, and that 5 is suitable. I was (am?) just wondering about the benefits of splitting the XHTML-generating parser into steps 4 and 5, and making 4 generate an XML interchange format, possibly compatible with wikicreole's.

...

From a programming perspective it seems nice to have a true *parser*

which focuses on processing input, then a *code generator* (probably written in XSLT) that produces output.

Obviously I'm only talking about doing this in a new parser, if/when that happens. The benefits would be too small to contemplate hacking that into hte current parser, I would think?

Steve

Tim Starling

25 Jan 25 Jan

6:44 a.m.

Steve Bennett wrote:

...

On 1/23/08, Tim Starling tstarling@wikimedia.org wrote:

...
The new preprocessor has an intermediate XML representation for pages before template inclusion, and it would be possible to store it. There's a RECOVER_ORIG mode that allows the original wikitext to be recovered from the XML. The problems with using it as a storage format are:

It's useless as an interchange format since it still depends on

thousands of lines of MediaWiki code to generate HTML from it.

The XML format, and the details of the transformation, are subject to

change.

Transformation from wikitext to preprocessed XML is relatively fast, and

will hopefully get faster with further development, so it can be generated on demand for any application that needs it.

Hmm, ok. I'm having a bit of trouble picturing this XML format that includes "preprocessed" wikitext but doesn't have templates substituted? Do you mean that at this stage you've parsed the structure of template and parser functions calls, but haven't yet substituted in the result?

Yes.

...

I think I agree that such an early stage of processing is not the place to generate an exchange format.

However, the XHTML level seems too late: instead of a neat "image" node, you'd end up with all the DIV tags used to actually display the thing in MediaWiki - as opposed to being an abstract representation.

Anyway, if I have understood the situation, writing an export of an interchange format would just be a lot of work, with no special benefit for us, to solve a problem there is not currently any great demand for. I think.

I don't think it's a lot of work, I think it's plain impossible. I think you should specify your goals and try to come up with a method, rather than specifying a method and hoping it will meet some goals.

In another post:

...

You've suggested that the XML generated by MediaWiki at 2 is no good as an interchange format, and that 5 is suitable. I was (am?) just wondering about the benefits of splitting the XHTML-generating parser into steps 4 and 5, and making 4 generate an XML interchange format, possibly compatible with wikicreole's.

That may well be possible, but it doesn't meet the goals specified in your original post, or in the PDF. WikiCreole is vastly simpler than MediaWiki wikitext. You can't convert MediaWiki wikitext to WikiCreole, unless you want just about everything to be in extension tags.

What's the WikiCreole equivalent of this?

<span class="plainlinks" style="speech-rate: fast"

...

[http://en.wikipedia.org/ Wikipedia] </span>

The definition of MediaWiki wikitext inherently references HTML and CSS. It can only be converted to formats with a similar capability set to HTML+CSS.

HTML+CSS is a well-specified format which aims to support output to all media types. It separates structure from presentation and provides for semantic annotation. There is a lot more content available in HTML+CSS than in any of the wikitext markup languages. That's why I wish research efforts were focused on analysis and conversion of this common language. Why would you want to convert directly from one restricted subset of HTML to an even more restricted subset? Why not improve annotation of MediaWiki's HTML output to make it more reuseable, and produce an HTML to MediaWiki wikitext converter?

-- Tim Starling

Simetrical

1:11 p.m.

On Jan 25, 2008 6:44 AM, Tim Starling tstarling@wikimedia.org wrote:

...

HTML+CSS is a well-specified format which aims to support output to all media types. It separates structure from presentation and provides for semantic annotation. There is a lot more content available in HTML+CSS than in any of the wikitext markup languages. That's why I wish research efforts were focused on analysis and conversion of this common language. Why would you want to convert directly from one restricted subset of HTML to an even more restricted subset? Why not improve annotation of MediaWiki's HTML output to make it more reuseable, and produce an HTML to MediaWiki wikitext converter?

Because for some purposes (e.g., a WYSIWYG editor), conversion both ways needs to be lossless, which it almost certainly won't be. You could do very verbose comments, but those immediately break when the user actually edits something (in the WYSIWYG case). If MediaWiki had been designed from the beginning to internally store almost everything as HTML, it would be at least conceivably doable to have a JavaScript HTML-to-wikitext converter that would be lossless, given appropriate annotation of the HTML (for templates, images, extensions, etc.). With wikitext, that's almost certainly impossible, so you have to just throw something together and hope it works well enough in most cases.

WYSIWYG is really the main motivation I see behind constructing a well-defined format of any kind. It would be nice if we could interoperate a little better with third parties, but I haven't seen any compelling application that needs this, and couldn't just use the HTML with maybe some comments to indicate templates. The compelling utility of interoperability is not with third parties, but with clients -- our own users' web browsers.

Unfortunately, this is probably not ever going to happen. Or at least not for a long, long time.

Tim Starling

7:55 p.m.

Simetrical wrote:

...

On Jan 25, 2008 6:44 AM, Tim Starling tstarling@wikimedia.org wrote:

...
HTML+CSS is a well-specified format which aims to support output to all media types. It separates structure from presentation and provides for semantic annotation. There is a lot more content available in HTML+CSS than in any of the wikitext markup languages. That's why I wish research efforts were focused on analysis and conversion of this common language. Why would you want to convert directly from one restricted subset of HTML to an even more restricted subset? Why not improve annotation of MediaWiki's HTML output to make it more reuseable, and produce an HTML to MediaWiki wikitext converter?

Because for some purposes (e.g., a WYSIWYG editor), conversion both ways needs to be lossless, which it almost certainly won't be.

Compared to conversion to WikiCreole? HTML two-way conversion sounds a lot more plausible to me than WikiCreole two-way conversion.

-- Tim Starling

Simetrical

26 Jan 26 Jan

7:06 p.m.

On Jan 25, 2008 7:55 PM, Tim Starling tstarling@wikimedia.org wrote:

...

Compared to conversion to WikiCreole? HTML two-way conversion sounds a lot more plausible to me than WikiCreole two-way conversion.

Do you think HTML two-way conversion is actually plausible, though? (I wasn't even considering WikiCreole.)

Tim Starling

8:03 p.m.

Simetrical wrote:

...

On Jan 25, 2008 7:55 PM, Tim Starling tstarling@wikimedia.org wrote:

...
Compared to conversion to WikiCreole? HTML two-way conversion sounds a lot more plausible to me than WikiCreole two-way conversion.

Do you think HTML two-way conversion is actually plausible, though? (I wasn't even considering WikiCreole.)

Well, maybe without templates, and if you made your goal stability, rather than losslessness, then it would start to get a bit more plausible. By that I mean: the original conversion from arbitrary wikitext to HTML may include some loss, because of the way multiple "bad" syntaxes, and one "good" syntax, are converted to the same target annotated HTML. But conversion from annotated HTML to wikitext would only produce "good" syntax, which would subsequently survive round-trip conversion.

To support templates, the output would have to be very heavily annotated indeed, to the point of including a complete copy of the source wikitext of the page and all templates.

That's an interesting idea for the problem in general actually: a client which understands MediaWiki wikitext intimately, and simultaneously edits the visual form (e.g. HTML) and the wikitext as you press keys.

There is, however, a philosophical question of whether decent structural multi-output markup can be produced by a WYSIWYG editor with untrained users.

-- Tim Starling

Simetrical

9:30 p.m.

On Jan 26, 2008 8:03 PM, Tim Starling tstarling@wikimedia.org wrote:

...

That's an interesting idea for the problem in general actually: a client which understands MediaWiki wikitext intimately, and simultaneously edits the visual form (e.g. HTML) and the wikitext as you press keys.

It might be easier if you could just not allow editing of things like templates (and other complicated things). I guess you could then use a bog-standard client-side HTML editor with only a bit of encoded metadata lurking about the place, and translate it server-side to wikitext. Not rendering templates as HTML in this mode -- e.g., substituting placeholders, or having the raw wikitext clear set off from the rest of the document somehow -- would undoubtedly be acceptable. In fact, isn't this basically what Wikiwyg does?

...

There is, however, a philosophical question of whether decent structural multi-output markup can be produced by a WYSIWYG editor with untrained users.

It should be straightforward to design such an editor. It need only provide semantic markup, is all. There are some editors for various formats that do this, typically billed as WYSIWYM (although I admit to not having used any). A WYSIWYG editor that provides only the markup options offered by basic wikitext now couldn't possibly be less semantic than the current wikitext, could it?

6187

Age (days ago)

6191

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

4 participants

tags (0)

participants (4)

Simetrical
Steve Bennett
Thomas Dalton
Tim Starling