On Jan 25, 2008 6:44 AM, Tim Starling tstarling@wikimedia.org wrote:
HTML+CSS is a well-specified format which aims to support output to all media types. It separates structure from presentation and provides for semantic annotation. There is a lot more content available in HTML+CSS than in any of the wikitext markup languages. That's why I wish research efforts were focused on analysis and conversion of this common language. Why would you want to convert directly from one restricted subset of HTML to an even more restricted subset? Why not improve annotation of MediaWiki's HTML output to make it more reuseable, and produce an HTML to MediaWiki wikitext converter?
Because for some purposes (e.g., a WYSIWYG editor), conversion both ways needs to be lossless, which it almost certainly won't be. You could do very verbose comments, but those immediately break when the user actually edits something (in the WYSIWYG case). If MediaWiki had been designed from the beginning to internally store almost everything as HTML, it would be at least conceivably doable to have a JavaScript HTML-to-wikitext converter that would be lossless, given appropriate annotation of the HTML (for templates, images, extensions, etc.). With wikitext, that's almost certainly impossible, so you have to just throw something together and hope it works well enough in most cases.
WYSIWYG is really the main motivation I see behind constructing a well-defined format of any kind. It would be nice if we could interoperate a little better with third parties, but I haven't seen any compelling application that needs this, and couldn't just use the HTML with maybe some comments to indicate templates. The compelling utility of interoperability is not with third parties, but with clients -- our own users' web browsers.
Unfortunately, this is probably not ever going to happen. Or at least not for a long, long time.