On 11/7/07, Steve Bennett stevagewp@gmail.com wrote:
- The stuff the BNF grammar doesn't cover is tacked on with some
other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
Ouch, now you're up to about 4 passes, which isn't far off the current version. Two passes would be good, like a C compiler: once for meta-markup (templates, parser functions), and once for content. Would it be possible to perhaps have an in-place pattern-based parser for the first phase, then a proper recursive descent for the content?
I don't know why it couldn't be two-pass. You obviously need some extra steps to insert things like the TOC, but those don't really count as a "pass". It would just require the parser to, say, parse all the stuff up to the TOC location as one string, then start a new string for stuff after the TOC should go, and concatenate them all together when it's finished actually parsing.
I suspect a major problem might arise if there are constructs that require more than one-token lookahead. There probably are, and apparently bison et al. can't parse those. But again, I would defer on this to someone who actually knows something. :) This kind of construct might be a good candidate for removal in any kind of grammar overhaul.
I guess there's no possibility of making wholesale changes to the grammar then implementing a migration script?
Honestly, that's what I'd like. I've suggested it before in the past. My suggestion was to use XML, and make up for the difficulty of direct editing with good WYSIWYG. A one-way conversion to XML should be relatively easy to make lossless, since we can just adapt the existing Parser with all its quirks. Writing a WYSIWYG editor good enough to effectively replace wikitext editing would be trickier, but also doable if it had the manpower. Of course it wouldn't work on non-JS-supporting clients, but XML is human-readable, right? :P Converting from XML back to wikitext losslessly is quite likely impossible, so editing in wikitext would no longer be an option. That, of course, met with substantial opposition.
Anyway, no one seems to have persevered with this task for long enough to present a list of recommended changes to make the grammar easier to parse, as far as I know. Although maybe I just haven't seen it. Brion has indicated in the past that reasonable changes to accommodate a formal grammar would be okay.
Wasn't there a move to get away from PHP for the parser? Is that not feasible?
Most people who use MediaWiki use it on shared hosting, and do not have the ability to install PHP modules or other non-PHP code. So some kind of PHP-based solution really needs to exist.
I have trouble picturing this. It could be horrendous. But if it could be managed so there were perhaps a few dozen complaints a day and not more, that might be doable.
I'm not worried. It's a wiki, it'll get fixed. As long as the effects don't impede the ability to read the text, which they probably won't, nothing will break down and die.
The parser moves though. I don't see a semi-formal grammar which isn't used for anything keeping pace.
The parser doesn't move very quickly. Anyone who fiddles with parser output without updating the parser tests gets a tongue-lashing already (as I know quite well); the same could be done for the parser description if one actually existed. But as of now, a comprehensive description of the current parser is probably impossible. My favorite symptom of how ad-hoc it is as a parser (I may have cited this before here) is this comment, line 1204:
# If there is an odd number of both bold and italics, it is likely # that one of the bold ones was meant to be an apostrophe followed # by italics. Which one we cannot know for certain, but it is more # likely to be one that has a single-letter word before it.
On 11/8/07, Andrew Garrett andrew@epstone.net wrote:
Wouldn't the most sensible way to come up with a formal specification be to write a dirty great big page with all the details of the way certain constructs are parsed?
No, because 1) it's not possible to be sufficiently comprehensive for reliable interoperability, and 2) even if it were, the resulting explanation would be so complicated that probably no one would bother.
BNF/EBNF/Bison/etc isn't really suited to the task at all: as Steve has mentioned, it's designed to answer "Does text Y match grammar Z?", rather than "what output should be created when text Y is parsed with grammar Z?"
Not BNF alone, but Bison sure is designed to create output of some kind. At least that's the very strong impression I got from reading the Bison manual: e.g., step 1 in using Bison is given as
"Formally specify the grammar in a form recognized by Bison (see section Bison Grammar Files). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements."
These C statements could, of course, include direct output, or preparation of some data structure for later use.
Anyway, maybe someone who's actually tried to start on something like this should chip in with their thoughts. timwi (with later stuff by Magnus) did something called flexbisonparse a couple of years ago:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/
The main Bison grammar/input file seems to be this:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikiparse.y?v...