On 11/14/07, Virgil Ierubino virgil.ierubino@gmail.com wrote:
I'm assuming our problem is this: currently we "parse" wikitext by immediately converting, via regex, into XHTML. This is not really "parsing", because parsing usually means the creation of an abstract Document Object Model which is then iterated through to generate XHTML, XML, FooBar or whatever (or so I have learnt). Because we're missing this DOM, Wikitext can't expand beyond being used by the current parser (so we can't do WYSIWYG, etc.). However, there appears to be no way of creating a DOM from Wikitext because this would be to standardise the way syntax converts to output, but any kind of standardisation will cause backwards incompatibility.
Your "DOM" is usually called an AST ("abstract syntax tree"). But yes. However, "backwards incompatibility" is not so much the issue as "sudden, drastic misrendering of existing wikitext".
I do think it's impossible to produce a meaningful traditional parser that could replicate exactly the behaviour of the current parser. I think it's very possible to produce a good parser that will cover all the most useful cases.
So our problem is the dilemma: either we standardise, and lose backwards
compatibility, or we don't, and lose extensibility. And in the long run, I think the first option is better. However, in standardising the language we'd lose the feature of it that all syntax is valid (useful, as then people can't ever be presented with error messages on their pages) so ideally the
The "all syntax is valid" thing really arises because of the nature of browsers rather than because of the parser itself. I don't think we're in danger of losing that - the parser will just have to fail gracefully when it comes up against malformed wikitext.
On the point of immutable validity, it is perhaps less useful for all text to be valid than for there to be "invalid markup" error messages. Whilst the former ensures users can never really "go wrong", it is still true that bad markup will lead to results they very much didn't intend - and it seems to me more useful to give them an error message than a wildly unintended result.
Wildly unintended is fine, at least they see that (or someone else does). What's more dangerous is when stuff silently breaks, making a sentence or two just disappear off the page.
Steve