Neil Harris wrote:
Yes! I also believe that PEGs and [[packrat parser]]s are the way to go with parsing wikitext, because of the very ad-hoc definition of wikitext.
Absolutely agreed. I only wish PEGs could support backreference matches, as it would clean up list, allowed HTML, and extension handling. In fact, I'm not quite sure how to handle lists without backreferences.
You can achieve considerable speedups by:
1 using the grammar to generate code, and compiling and executing that instead of interpreting the grammar by hand
Definitely - come to think of it, I bet this could be done VERY nicely with Python. Or most other sufficiently self-exposed languages... Hm.
2 allowing the grammar to contain both PEG expressions and regexps for low-level lexical matching: regexps will be at least an order of magnitude faster than even compiled PEGs for matching low-level lexical tokens like numbers and names, without removing the ability of PEGs to blur the distinction between lexical and syntactic analysis, which is important for parsing strange things like wikitext.
This sounds like a great idea for extended PEGs anyway... I'll remember that if I end up building an mxTextTools frontend for PEGs, since mxTextTools can easily hook into arbitrary matching functions (including regex).
I've implemented packrat parsing in both Python and Scheme: Scheme was faster, and ultimately more natural.
That's quite possible - the problem would be that I don't know Scheme, and I am going to be extremely busy for the foreseeable future at school. I'd rather not have to write a packrat parser myself, anyway... However simple they may be, they improve drastically with optimizations, and I don't anticipate having the time to implement a proper system.
Unless a good Python-accessible packrat parser already exists, I'm most likely to just build a solid PEG frontend for mxTextTools. It's a very powerful text parser, and tends to be fast (the module's mostly written in C). I think it could easily support all PEG features. Actually, I think SimpleParse (another mxTextTools frontend) already supports at least 90% of PEG features, so maybe the best idea is simply to rework SimpleParse to use standard PEG syntax instead of its extremely extended BNF variant.
I'm not sure about the best way to implement an API: have you considered just using the parser to convert from wikitext to somthing like PYX, which is a very simple-to-parse and Python-friendly representation of an XML data structure...
Something like that would probably be ideal, although I'd tend to prefer a more abstract data structure that's programmatically accessible - maybe an mxTextTools tag list (its normal output format) is closer to what I mean.
- Eric Astor
mxTextTools: http://www.egenix.com/files/python/mxTextTools.html SimpleParse: http://simpleparse.sourceforge.net/simpleparse_grammars.html