On 11/7/07, Steve Bennett stevagewp@gmail.com wrote:
What exactly is the "goal"? If it's just "formally defining whatever it is that the code currently does", is that a worthy goal?
Probably not. The best we can hope for is likely something like:
1) A BNF grammar is developed that fits almost all the commonly-used features in. This will probably require unlimited lookahead, but I do think (without, admittedly, much of any formal grounding in the theory of all this) it's possible if that's allowed, keeping in mind the "almost all" caveat.
2) Now that we have a grammar, a yacc parser is compiled, and appropriate rendering bits are added to get it to render to HTML.
3) The stuff the BNF grammar doesn't cover is tacked on with some other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
4) All of this breaks a thousand different corner cases and half the parser tests. The implementers carefully go through every failed parser test, rewrite it to the actual output, and carefully justify why this is the correct course of action. Or just assume it is, depending on the level of care.
5) A PHP implementation of the exact same grammar is implemented. How practical this is, I don't know, but it's critical unless we want pretty substantially different behavior for people using the PHP module versus not. It is not acceptable to force third parties to use a PHP module, nor to grind their parser to a halt (which a naive compilation of the grammar into PHP would probably do).
6) Everything is rolled out live. Pages break left and right. Large complaint threads are started on the Village Pump, people fix it, and everyone forgets about it. Developers get a warm fuzzy feeling for having finally succeeded at destroying Parser.php.
This is if it's to be done properly. A semi-formal specification that's not directly useful for parsing pages would involve a lot less work and perhaps correspondingly less benefit. It could still improve operability with third parties dramatically; perhaps that's the only goal other people have in mind, not the ability to compile a parser with some yacc equivalent. I don't know.
On 11/7/07, Steve Bennett stevagewp@gmail.com wrote:
Not to mention that BNF is not really suited to the task. BNF is supposed to answer the question "does text A match grammar B?" However, essentially all wikitext is "valid" - so we're really looking for something that answers the question "how should text A be rendered" or "what is the meaning of text A" or even "how should text A be converted into a decorated* syntax tree".
BNF does that. The *language* generated by a grammar is distinct from the grammar itself: two grammars can be different but generate the same language. In this case, the language might be the set of all strings, but applying the grammar to a string gets us a parse tree, which is what we want. Specifically, yacc and similar programs (e.g., bison) will execute provided code snippets every time they encounter a particular terminal symbol from the grammar, or something like that, I gather. This should be able to include appending to an HTML output string.