On Thu, Aug 17, 2006 at 12:21:13AM -0400, Eric Astor wrote:
Let's see here. Please consider this an incomplete, unreliable list, meant solely as an indication of the basic problems encountered when attempting to formalize MediaWiki's wikitext... And I'm no expert on parsing, except in that I've spent a large part of the summer constructing parsers for essentially unparseable languages. Basic point, though, is that MediaWiki wikitext is INCREDIBLY context-sensitive.
Single case that shows something interesting: '''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get? <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
In other words, you've discovered that the current syntax supports improper nesting of markup, in a rather unique fashion. I don't know of any way to duplicate this in any significantly formal system, although I believe a multiple-pass parser *might* be capable of handling it. In fact, some sort of multiple-pass parser (the MediaWiki parser) obviously can.
I suspect that the "proper" parsing of that particular combination is undefined, and therefore you cna do anything you like.
That's one of the points I was suggesting.
Also, templates need to be transcluded before most of the parsing can take place, since in the current system, the text may leave some syntactically-significant constructs incomplete, finishing them in the transclusion stage...
And of course, there's extensions, but I gather they're responsible for calling the parser themselves, which seemed to make sense.
In summary, for most definitions of formal, it is impossible to write formal grammars for most significant subsets of current MediaWiki syntax. I had significant success with a regex-based grammar specification (using Martel), backed by a VERY general backend capable of back-tracking and other clever tricks (mxTextTools) - but the recursive structure is virtually impossible to handle in a regex-based framework.
- Eric Astor
P.S. As indicated above, I honestly feel that the difficulties aren't insurmountable - if you're willing to build an appropriate parsing framework, which will be semi-formal at best.
P.P.S. When possible, in my *copious* free time (</sarcasm>), I'm hoping to take another frontend to mxTextTools (SimpleParse, to be specific), modify it sufficiently to support all the necessary features, and then build something capable of parsing the current MediaWiki syntax (although I might have to drop support for improper nesting). I've no idea if or when this might happen, but I'm considering it a long-term goal if the current situation doesn't improve.
I don't know that I think that the spec has to be something you can feed to Bison, certainly. But it has to be unambiguously parseable, with as many corner cases defined as you can manage, at least by humans, before it's worth trying anything more complicated.
And it's going to *have* to be done sooner or later. I haven't ever even looked at the parser code, and just from people talking about, I can tell that there will come a time when it's just too tense to work on anymore.
Hopefully it will get replaced before then.
On Wed, Aug 16, 2006 at 11:26:22PM -0400, Ivan Krsti?? wrote:
Jay R. Ashworth wrote:
I don't know how useful it will be to have wikitext specified strictly, and I don't think we'll be able to tell until we see how far off we are, and what might need to be tweaked.
This was discussed at hacking days. Brion's pronouncement is that the current syntax will admit essentially no backwards-incompatible changes.
My point was more based on taking advantage on the implementation-defined and -dependent portions of the current 'spec'; things like specifying binding and precedence rules concerning things like Eric' first example, above.
It's unfortunate that formalization went on the table so late, but it gets done for a reason, and, being an outgrowth of an engineering construct, if you need it, and you don't do it, then you Just Can't do whatever it was that made you decided you needed it.
Wasn't someone from SoC working on this?
Did we ever get a final status report from the SoC work? (It's done now, isn't it?)
And let's be quite clear: *brion* (and Tim) will admit no backwards-imcompatible changes, not the syntax. The syntax is an inanimate non-object.
(I'm not trying to be combative, there, just honest.)
Cheers, -- jra
Cheers, -- jra
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Aug 17, 2006 at 12:21:13AM -0400, Eric Astor wrote:
Single case that shows something interesting: '''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get? <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
I suspect that the "proper" parsing of that particular combination is undefined, and therefore you cna do anything you like.
Actually, the proper parsing of that particular combination is defined perfectly well. The problem is that HTML doesn't allow overlapping tags. If it did allow them, you would get a straight substitution:
'''hi''hello'''hi'''hello''hi''' -> <b>hi<i>hello</b>hi<b>hello</i>hi</b>
Each ''' becomes </?b>, each '' becomes </?i>. That's definitely defined behavior. But to make it valid HTML, you need to close the <i> before the <b>, then reopen it after the <b>, and so on. So this isn't something that can be worked around.
Wasn't someone from SoC working on this?
Did we ever get a final status report from the SoC work? (It's done now, isn't it?)
There were two SoC projects, one to have embeddable media and one to have a forum-like talk page instead of our current wiki thing. The former I don't know what happened to, the latter we have prototype code and a largely completed design for (and I believe the author of that project has agreed to try to see it through). The deadline is in a few days, I think, unless it's already past.
On Thu, Aug 17, 2006 at 02:03:07AM -0400, Simetrical wrote:
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Aug 17, 2006 at 12:21:13AM -0400, Eric Astor wrote:
Single case that shows something interesting: '''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get? <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
I suspect that the "proper" parsing of that particular combination is undefined, and therefore you cna do anything you like.
Actually, the proper parsing of that particular combination is defined perfectly well. The problem is that HTML doesn't allow overlapping tags. If it did allow them, you would get a straight substitution:
'''hi''hello'''hi'''hello''hi''' -> <b>hi<i>hello</b>hi<b>hello</i>hi</b>
Each ''' becomes </?b>, each '' becomes </?i>. That's definitely defined behavior. But to make it valid HTML, you need to close the <i> before the <b>, then reopen it after the <b>, and so on. So this isn't something that can be worked around.
Sorry; I missed that he had not included the canonical example of this problem, which is that if you get ''''', there can be constructions wherein it's not possible to determine whether you're ending a bold and starting an italic, or the reverse, without context.
Wasn't someone from SoC working on this?
Did we ever get a final status report from the SoC work? (It's done now, isn't it?)
There were two SoC projects, one to have embeddable media and one to have a forum-like talk page instead of our current wiki thing. The former I don't know what happened to, the latter we have prototype code and a largely completed design for (and I believe the author of that project has agreed to try to see it through). The deadline is in a few days, I think, unless it's already past.
Ah. I had missed the first one, and am interested to see how the second one works out.
Cheer,s -- jra
wikitech-l@lists.wikimedia.org