On 3 January 2011 21:54, Andreas Jonsson <andreas.jonsson(a)kreablo.se> wrote:
2010-12-29 08:33, Andrew Dunbar skrev:
I've thought a lot about this too. It
certainly is not any type of
standard grammar. But on the other hand it is a pretty common kind of
nonstandard grammar. I call it a "recursive text replacement grammar".
Perhaps this type of grammar has some useful characteristics we can
discover and document. It may be possible to follow the code flow and
document each text replacement in sequence as a kind of parser spec
rather than trying and failing again to shoehorn it into a standard
If it is possible to extract such a spec it would then be possible to
implement it in other languages.
Some research may even find that is possible to transform such a
grammar deterministically into an LALR grammar...
But even if not I'm certain it would demysitfy what happens in the
parser so that problems and edge cases would be easier to locate.
From my experience of implementing a wikitext parser, I would say that
it might be possible to transform wikitext to a token stream that is
possible to parse with a LALR parser. My implementation
uses Antlr (which is an LL parser generator) and only rely on context
sensitive parsing (Antlr's semantic predicates) for parsing
apostrophes (bold and italics), and this might be possible to solve in
a different way. The rest of the complex cases are handled by the
lexical analyser that produce a well behaving token stream that can be
relatively straightforwardly parsed.
My implementation is not 100% compatible, but I think that a 100%
compatible parser is not desirable since the most exotic border cases
would probably be characterized as "bugs" anyway (e.g. [[Link|<table
class="]]">). But I think that the basic idea can be used to produce
a sufficiently compatible parser.
In that case what is needed is to hook your parser into our current code
and get it create output if you have not done that already. Then you will
want to run the existing parser tests on it. Then you will want to run both
parsers over a large sample of existing Wikipedia articles (make sure
you use the same revisions on both parsers!) and run them through diff.
Then we'll have a decent idea of whether there are any edge cases you
didn't spot or whether any of them are exploited in template magic.
Let us know the results!
Andrew Dunbar (hippietrail)
Wikitech-l mailing list