On 11/7/07, Steve Bennett <stevagewp(a)gmail.com> wrote:
3) The stuff the BNF grammar doesn't cover is
tacked on with some
other methods. In practice, it seems like a
two-pass parser would be
ideal: one recursive pass to deal with templates and other
substitution-type things, then a second pass with the actual grammar
of most of the language. The first pass is of necessity recursive, so
there's probably no point in having it spend the time to repeatedly
parse italics or whatever, when it's just going to have to do it again
when it substitutes stuff in. Further rendering passes are going to
be needed, e.g., to insert the table of contents. Further parsing
passes may or may not be needed.
Ouch, now you're up to about 4 passes, which isn't far off the current
version. Two passes would be good, like a C compiler:
once for meta-markup (templates, parser
functions), and once for content. Would it be possible to perhaps have
an in-place pattern-based parser for the first phase, then a proper
recursive descent for the content?
I don't know why it couldn't be two-pass. You obviously need some
extra steps to insert things like the TOC, but those don't really
count as a "pass". It would just require the parser to, say, parse
all the stuff up to the TOC location as one string, then start a new
string for stuff after the TOC should go, and concatenate them all
together when it's finished actually parsing.
I suspect a major problem might arise if there are constructs that
require more than one-token lookahead. There probably are, and
apparently bison et al. can't parse those. But again, I would defer
on this to someone who actually knows something. :) This kind of
construct might be a good candidate for removal in any kind of grammar
overhaul.
I guess there's no possibility of making wholesale
changes to the grammar
then implementing a migration script?
Honestly, that's what I'd like. I've suggested it before in the past.
My suggestion was to use XML, and make up for the difficulty of
direct editing with good WYSIWYG. A one-way conversion to XML should
be relatively easy to make lossless, since we can just adapt the
existing Parser with all its quirks. Writing a WYSIWYG editor good
enough to effectively replace wikitext editing would be trickier, but
also doable if it had the manpower. Of course it wouldn't work on
non-JS-supporting clients, but XML is human-readable, right? :P
Converting from XML back to wikitext losslessly is quite likely
impossible, so editing in wikitext would no longer be an option.
That, of course, met with substantial opposition.
Anyway, no one seems to have persevered with this task for long enough
to present a list of recommended changes to make the grammar easier to
parse, as far as I know. Although maybe I just haven't seen it.
Brion has indicated in the past that reasonable changes to accommodate
a formal grammar would be okay.
Wasn't there a move to get away from PHP for the
parser? Is that not
feasible?
Most people who use MediaWiki use it on shared hosting, and do not
have the ability to install PHP modules or other non-PHP code. So
some kind of PHP-based solution really needs to exist.
I have trouble picturing this. It could be horrendous.
But if it could
be managed so there were perhaps a few dozen complaints a day and not
more, that might be doable.
I'm not worried. It's a wiki, it'll get fixed. As long as the
effects don't impede the ability to read the text, which they probably
won't, nothing will break down and die.
The parser moves though. I don't see a semi-formal
grammar which isn't used
for anything keeping pace.
The parser doesn't move very quickly. Anyone who fiddles with parser
output without updating the parser tests gets a tongue-lashing already
(as I know quite well); the same could be done for the parser
description if one actually existed. But as of now, a comprehensive
description of the current parser is probably impossible. My favorite
symptom of how ad-hoc it is as a parser (I may have cited this before
here) is this comment, line 1204:
# If there is an odd number of both bold and italics, it is likely
# that one of the bold ones was meant to be an apostrophe followed
# by italics. Which one we cannot know for certain, but it is more
# likely to be one that has a single-letter word before it.
On 11/8/07, Andrew Garrett <andrew(a)epstone.net> wrote:
Wouldn't the most sensible way to come up with a
formal specification
be to write a dirty great big page with all the details of the way
certain constructs are parsed?
No, because 1) it's not possible to be sufficiently comprehensive for
reliable interoperability, and 2) even if it were, the resulting
explanation would be so complicated that probably no one would bother.
BNF/EBNF/Bison/etc isn't really suited to the task
at all: as Steve
has mentioned, it's designed to answer "Does text Y match grammar Z?",
rather than "what output should be created when text Y is parsed with
grammar Z?"
Not BNF alone, but Bison sure is designed to create output of some
kind. At least that's the very strong impression I got from reading
the Bison manual: e.g., step 1 in using Bison is given as
"Formally specify the grammar in a form recognized by Bison (see
section Bison Grammar Files). For each grammatical rule in the
language, describe the action that is to be taken when an instance of
that rule is recognized. The action is described by a sequence of C
statements."
These C statements could, of course, include direct output, or
preparation of some data structure for later use.
Anyway, maybe someone who's actually tried to start on something like
this should chip in with their thoughts. timwi (with later stuff by
Magnus) did something called flexbisonparse a couple of years ago:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/
The main Bison grammar/input file seems to be this:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikiparse.y?…