I've created a page for testing and debugging our parser: http://meta.wikipedia.org/wiki/Parser_testing
It's far from complete, but anyone with a knowledge of our syntax can add to it and I hope it grows into something useful. So if you want to help debugging the software but don't know how, please contribute here.
There should be one section for each test. A test can consist of a variety of individual cases, all of which should be added first as <pre>..</pre>, then verbatim, so that the difference between user input and parser output can be seen.
One bug I noticed is that mixed lists seem to no longer work, unless I'm missing something. I also think that we shouldn't use stuff like "left" and "50px" as image alt text.
It's not always obvious what we *want* our behavior to be. Right now I'm assuming that everything the parser does is correct unless the result is really strange.
Kudos to Jens for the image caption parsing code, it seems to grok almost anything.
Regards,
Erik
Erik Moeller wrote:
anyone with a knowledge of our syntax
Has anyone ever written a context-free grammar to describe our syntax?
I think if I can be sufficiently bothered, I'll do that after the exams. Then someone else can write a recursive-descent parser from it.
I also think that we shouldn't use stuff like "left" and "50px" as image alt text.
For obvious reasons, the software cannot know if "left"/"50px" is actually supposed to be the alt text. I think our syntax works so that for [[Image:Filename.ext|50px]], this is correct behaviour. To actually make an image 50px wide, it should be [[Image:Filename.ext|50px|]], and that should generate an empty alt text.
Timwi
On Wed, Jun 02, 2004 at 12:43:03PM +0100, Timwi wrote:
Has anyone ever written a context-free grammar to describe our syntax?
I had been playing with a flex/bison combo, but Wiki's not suited to LALR; should be trivial to switch out the lexer for something in-house, though.
Bison, on the other hand, seemed to do the trick.
~Peter
Has anyone ever written a context-free grammar to describe our syntax?
I had been playing with a flex/bison combo, but Wiki's
not suited to LALR; should be trivial to switch out the lexer for something in-house, though.
Bison, on the other hand, seemed to do the trick.
As i'm working on a wikipedia-related project involving parsing articles, that really interests me. I've been thinking of lex/bison too, didn't yet do it :) So lex isn't that great? Could you send grammars you defined?
~Peter
Nicolas 'Ryo'
In addition to what Nicolas Weeger already requested, might I also ask exactly what syntax element you found unsuitable? I can't think of anything that isn't context-free.
What I mean is, Timwi, while WikiML might be reducible to a context-free grammar; it's not necessarily suited to one-token lookahead parsing, where Bison excels.
Like SGML and HTML, however, WikiML is describable in terms of a DTD; and DTD is written in Extended Backus-Naur Form. In spite of EBNF, WikiML's features which distinguish it from XML might make constructing a legitimate context- free grammar tedious; on the other hand, many of the points brought against SGML's context freedom may or may not apply to Wiki [1]:
SGML WikiML -------------------------------------------------------- Declared content -> Initial spaces Inclusion exclusions -> Nested definitions AND groups -> Link morphology OMITTAG -> Section and subsections --------------------------------------------------------
Constucting a rigorous WikiML DTD would enable us to erect an LL(1) grammar at will, and should be our first job; in addition, HTML Tidy's lexer/parser provides a fantastic example of recursive-descent application.
Best, Peter ----------- 1 See Joe English' classic discussion of SGML's context freedom, which he still considered an open question: "It is however possible to create an equiva- lent context-free (BNF) grammar from any XML DTD, where the terminals are start-tags, end- tags, and #PCDATA and the productions correspond to element types and content model positions. The reverse is not possible in general, so XML DTDs are (in a sense) a subset of BNF. "The same *might* be true of SGML, but when you consider things like declared content, exceptions, AND groups, and OMITTAG, converting a general SGML DTD to a CFG is a much more dif- ficult problem. "Do SGML DTDs define context- free languages" is still an open question AFAIK. (I suspect the answer is "yes", but even if so it would not be a very useful result; the derived CFG would be intractably large in many cases.)" (http://xml.coverpages.org//english-cfg.html)
wikitech-l@lists.wikimedia.org