On 11/8/07, Simetrical Simetrical+wikilist@gmail.com wrote:
- Now that we have a grammar, a yacc parser is compiled, and
appropriate rendering bits are added to get it to render to HTML.
People have already done this, at least once, haven't they? Do we have a list of attempts?
3) The stuff the BNF grammar doesn't cover is tacked on with some
other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
Ouch, now you're up to about 4 passes, which isn't far off the current version. Two passes would be good, like a C compiler: once for meta-markup (templates, parser functions), and once for content. Would it be possible to perhaps have an in-place pattern-based parser for the first phase, then a proper recursive descent for the content?
Unfortunately the deliberate apparent similarity of lots of very different language features ({{foo}} vs {{foo:blah}}, [[Project:Link]] vs [[Category:Link]] etc) makes much of this very complex.
I guess there's no possibility of making wholesale changes to the grammar then implementing a migration script?
4) All of this breaks a thousand different corner cases and half the
parser tests. The implementers carefully go through every failed parser test, rewrite it to the actual output, and carefully justify why this is the correct course of action. Or just assume it is, depending on the level of care.
Sounds good to me. I wonder also if there is any chance of implementing two parsers and migrating slowly from one to the next. Perhaps all Wikipedia pages starting with Ab... could be rendered with the new parser while others use the old? Pages using the new parser could have a warning displayed like "Are there problems with the way the content is displayed? Click here...". And wait for people to actually report perceived problems - as opposed to the page failing a regression test.
5) A PHP implementation of the exact same grammar is implemented. How
practical this is, I don't know, but it's critical unless we want pretty substantially different behavior for people using the PHP module versus not. It is not acceptable to force third parties to use a PHP module, nor to grind their parser to a halt (which a naive compilation of the grammar into PHP would probably do).
Wasn't there a move to get away from PHP for the parser? Is that not feasible?
6) Everything is rolled out live. Pages break left and right. Large
complaint threads are started on the Village Pump, people fix it, and everyone forgets about it. Developers get a warm fuzzy feeling for having finally succeeded at destroying Parser.php.
I have trouble picturing this. It could be horrendous. But if it could be managed so there were perhaps a few dozen complaints a day and not more, that might be doable.
This is if it's to be done properly. A semi-formal specification
that's not directly useful for parsing pages would involve a lot less work and perhaps correspondingly less benefit. It could still improve operability with third parties dramatically; perhaps that's the only goal other people have in mind, not the ability to compile a parser with some yacc equivalent. I don't know.
The parser moves though. I don't see a semi-formal grammar which isn't used for anything keeping pace.
Steve