Eric Astor wrote:
I've actually been working on a Python-based wikitext parser, using some techniques that should make the system a bit faster and cleaner... With a lot of luck, I should start making progress on that again in the next month or so.
For anyone who cares, I'll probably be trying to implement a PEG-based parser using mxTextTools, since I think that should be able to parse all of MediaWiki's wikitext, and should be about twice as fast as the current Parser.php (which is about as fast as wiki2xml)... Or I might just end up using ANTLR, if I can bully my current semi-grammar into working in that framework... If anyone knows of a decent PEG parser with a Python API (a packrat parser might be ideal), that'd be great too. *shrugs*
- Eric Astor
Yes! I also believe that PEGs and [[packrat parser]]s are the way to go with parsing wikitext, because of the very ad-hoc definition of wikitext.
A basic packrat parser is pretty easy to implement; it's simply a brute-force recursive-descent parser with memoization of (offset, term) -> production mappings. Scheme is a pretty good language to write a packrat parser in, since the grammar itself can be written as an S-expression, and is easy to use for program transformation (see below).
A simple implementation just interprets the grammar tree, matching as it goes.
You can achieve considerable speedups by:
1 using the grammar to generate code, and compiling and executing that instead of interpreting the grammar by hand
2 allowing the grammar to contain both PEG expressions and regexps for low-level lexical matching: regexps will be at least an order of magnitude faster than even compiled PEGs for matching low-level lexical tokens like numbers and names, without removing the ability of PEGs to blur the distinction between lexical and syntactic analysis, which is important for parsing strange things like wikitext.
I've implemented packrat parsing in both Python and Scheme: Scheme was faster, and ultimately more natural.
The one awkward bit is left-recursion removal, which breaks packrat parsers unless you alter the grammar to an equivalent form without left recursion. I did it by hand on my input grammars, but it could easily be done programatically at grammar-generation time.
I'm not sure about the best way to implement an API: have you considered just using the parser to convert from wikitext to somthing like PYX, which is a very simple-to-parse and Python-friendly representation of an XML data structure, and can then be used either to build an in-core parse tree, or drive something like a SAX API, or whatever other form of post-processing you like (for example, direct procedural text-to-text generation, which could be very simple indeed, since the output of a successful parse is guaranteed by definition to _exactly_ conform to the grammar specification).
-- Neil
[PYX: http://www.xml.com/pub/a/2000/03/15/feature/index.html]