Formalizing Wikitext (was: Dajoo) - Wikitech-l

17 Aug 2006


      On Thu, Aug 17, 2006 at 12:21:13AM -0400, Eric Astor wrote:
...
Let's see here. Please consider this an incomplete, unreliable list, meant
solely as an indication of the basic problems encountered when attempting to
formalize MediaWiki's wikitext... And I'm no expert on parsing, except in
that I've spent a large part of the summer constructing parsers for
essentially unparseable languages. Basic point, though, is that MediaWiki
wikitext is INCREDIBLY context-sensitive.
Single case that shows something interesting:
'''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get?
<b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
In other words, you've discovered that the current syntax supports improper
nesting of markup, in a rather unique fashion. I don't know of any way to
duplicate this in any significantly formal system, although I believe a
multiple-pass parser *might* be capable of handling it. In fact, some sort
of multiple-pass parser (the MediaWiki parser) obviously can.
I suspect that the "proper" parsing of that particular combination is
undefined, and therefore you cna do anything you like.
That's one of the points I was suggesting.
...
Also, templates need to be transcluded before most of the parsing can take
place, since in the current system, the text may leave some
syntactically-significant constructs incomplete, finishing them in the
transclusion stage...
And of course, there's extensions, but I gather they're responsible for
calling the parser themselves, which seemed to make sense.
...
In summary, for most definitions of formal, it is impossible to write formal
grammars for most significant subsets of current MediaWiki syntax. I had
significant success with a regex-based grammar specification (using Martel),
backed by a VERY general backend capable of back-tracking and other clever
tricks (mxTextTools) - but the recursive structure is virtually impossible
to handle in a regex-based framework.

Eric Astor

P.S. As indicated above, I honestly feel that the difficulties aren't
insurmountable - if you're willing to build an appropriate parsing
framework, which will be semi-formal at best.
P.P.S. When possible, in my *copious* free time (</sarcasm>), I'm hoping to
take another frontend to mxTextTools (SimpleParse, to be specific), modify
it sufficiently to support all the necessary features, and then build
something capable of parsing the current MediaWiki syntax (although I might
have to drop support for improper nesting). I've no idea if or when this
might happen, but I'm considering it a long-term goal if the current
situation doesn't improve.
I don't know that I think that the spec has to be something you can
feed to Bison, certainly.  But it has to be unambiguously parseable,
with as many corner cases defined as you can manage, at least by
humans, before it's worth trying anything more complicated.
And it's going to *have* to be done sooner or later.  I haven't ever
even looked at the parser code, and just from people talking about, I
can tell that there will come a time when it's just too tense to work
on anymore.
Hopefully it will get replaced before then.
On Wed, Aug 16, 2006 at 11:26:22PM -0400, Ivan Krsti?? wrote:
...
Jay R. Ashworth wrote:
...
I don't know how useful it will be to have wikitext specified strictly,
and I don't think we'll be able to tell until we see how far off we
are, and what might need to be tweaked.
This was discussed at hacking days. Brion's pronouncement is that the
current syntax will admit essentially no backwards-incompatible changes.
My point was more based on taking advantage on the
implementation-defined and -dependent portions of the current 'spec';
things like specifying binding and precedence rules concerning things
like Eric' first example, above.
It's unfortunate that formalization went on the table so late, but it
gets done for a reason, and, being an outgrowth of an engineering
construct, if you need it, and you don't do it, then you Just Can't do
whatever it was that made you decided you needed it.
Wasn't someone from SoC working on this?
Did we ever get a final status report from the SoC work?  (It's done
now, isn't it?)
And let's be quite clear: *brion* (and Tim) will admit no
backwards-imcompatible changes, not the syntax.  The syntax is an
inanimate non-object.
(I'm not trying to be combative, there, just honest.)
Cheers,
-- jra
Cheers,
-- jra
-- 
Jay R. Ashworth                                                jra@baylink.com
Designer                          Baylink                             RFC 2100
Ashworth & Associates        The Things I Think                        '87 e24
St Petersburg FL USA      http://baylink.pitas.com             +1 727 647 1274

    The Internet: We paved paradise, and put up a snarking lot.