-----Original Message----- From: Ivan Krstic [mailto:krstic@solarsail.hcs.harvard.edu] Sent: Wednesday, August 16, 2006 11:21 PM To: Wikimedia developers Cc: Eric Astor Subject: Re: [Wikitech-l] Dajoo: a Java-based offline editor/viewer
Simetrical wrote:
Out of curiosity, why not? What bits of markup screw over the project?
A better question is which bits don't ;)
Eric?
*sighs* Well, that's my cue. My sincere apologies for when this starts to ramble and lose coherency - I'm a bit tired today, and I've been focusing mostly on other things since the Wikimania Hacking Days.
Let's see here. Please consider this an incomplete, unreliable list, meant solely as an indication of the basic problems encountered when attempting to formalize MediaWiki's wikitext... And I'm no expert on parsing, except in that I've spent a large part of the summer constructing parsers for essentially unparseable languages. Basic point, though, is that MediaWiki wikitext is INCREDIBLY context-sensitive.
Single case that shows something interesting: '''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get? <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
In other words, you've discovered that the current syntax supports improper nesting of markup, in a rather unique fashion. I don't know of any way to duplicate this in any significantly formal system, although I believe a multiple-pass parser *might* be capable of handling it. In fact, some sort of multiple-pass parser (the MediaWiki parser) obviously can.
Also, templates need to be transcluded before most of the parsing can take place, since in the current system, the text may leave some syntactically-significant constructs incomplete, finishing them in the transclusion stage...
Beyond that... *pulls up his mostly-aborted attempt at a parser*
Indefinite lookahead is required in some places, particularly in headings and magic variables - and for almost any other multi-part markup, if we want to do what the current parser does (ignore incomplete markup, treating it as if it had been properly escaped). This even holds true for bold and italics, since you need indefinite lookahead to be able to tell whether the first three quotes in '''this'' should be parsed as ''', <i>', or <b>. The situation gets even worse when you try to allow for improper nesting.
Other places require fixed, but large, amounts of lookahead... freelinks require at least 9 characters, for example. Technically, I'll admit that a GLR parser (or a backtracking framework) could manage even the indefinite lookahead that I mentioned... but it's still problematic, since the grammar is left ambiguous in certain cases.
Oh, right - and we'd need to special-case every tag-style piece of markup, including every allowed HTML tag, since formal grammars generally can't reference previously-matched text. This also applies to the heading levels - we'd need separate ad-hoc constructs for each level of heading we wanted to support, duplicating a lot of the grammar between each one.
There are other complications as well - again, this list should be considered both incomplete and, possibly, inconsistent with reality.
In summary, for most definitions of formal, it is impossible to write formal grammars for most significant subsets of current MediaWiki syntax. I had significant success with a regex-based grammar specification (using Martel), backed by a VERY general backend capable of back-tracking and other clever tricks (mxTextTools) - but the recursive structure is virtually impossible to handle in a regex-based framework.
- Eric Astor
P.S. As indicated above, I honestly feel that the difficulties aren't insurmountable - if you're willing to build an appropriate parsing framework, which will be semi-formal at best.
P.P.S. When possible, in my *copious* free time (</sarcasm>), I'm hoping to take another frontend to mxTextTools (SimpleParse, to be specific), modify it sufficiently to support all the necessary features, and then build something capable of parsing the current MediaWiki syntax (although I might have to drop support for improper nesting). I've no idea if or when this might happen, but I'm considering it a long-term goal if the current situation doesn't improve.