On 8/17/06, Eric Astor <eastor1(a)swarthmore.edu> wrote:
Single case that shows something interesting:
'''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get?
<b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
That's awesome :)
In other words, you've discovered that the current
syntax supports improper
nesting of markup, in a rather unique fashion. I don't know of any way to
duplicate this in any significantly formal system, although I believe a
multiple-pass parser *might* be capable of handling it. In fact, some sort
of multiple-pass parser (the MediaWiki parser) obviously can.
Is this not the sort of "backwards compatibility" that we could safely
do without? Does anyone intentionally use that kind of construct?
Also, templates need to be transcluded before most of
the parsing can take
place, since in the current system, the text may leave some
syntactically-significant constructs incomplete, finishing them in the
transclusion stage...
That's sort of a given, isn't it? What's the downside of doing
transclusion first?
if it had been properly escaped). This even holds true
for bold and italics,
since you need indefinite lookahead to be able to tell whether the first
three quotes in '''this'' should be parsed as ''',
<i>', or <b>. The
situation gets even worse when you try to allow for improper nesting.
Personally I find the rules for multiple apostrophes very strange and
unpredictable - and hence worth changing. I was really surprised when
I sat down one to day test what happens when you stack one, two,
three...ten apostrophes. Not what I expected at all. No takers to
replace ''' with // or something?
Other places require fixed, but large, amounts of
lookahead... freelinks
require at least 9 characters, for example. Technically, I'll admit that a
What's a freelink?
GLR parser (or a backtracking framework) could manage
even the indefinite
lookahead that I mentioned... but it's still problematic, since the grammar
is left ambiguous in certain cases.
Oh, right - and we'd need to special-case every tag-style piece of markup,
including every allowed HTML tag, since formal grammars generally can't
reference previously-matched text. This also applies to the heading levels -
we'd need separate ad-hoc constructs for each level of heading we wanted to
support, duplicating a lot of the grammar between each one.
I don't understand, can you give an example?
P.S. As indicated above, I honestly feel that the
difficulties aren't
insurmountable - if you're willing to build an appropriate parsing
framework, which will be semi-formal at best.
What would such a thing look like, formal BNE rules mixed in with text
like "Actually if FOO is "boo" then special case Z is invoked..."?
Steve