[Wikitech-l] Wikitext parsing (was RE: flexbisonparse)

28 Aug 2006


      Neil Harris wrote:
...
Yes! I also believe that PEGs and [[packrat parser]]s are the 
way to go with parsing wikitext, because of the very ad-hoc 
definition of wikitext.
Absolutely agreed. I only wish PEGs could support backreference matches, as
it would clean up list, allowed HTML, and extension handling. In fact, I'm
not quite sure how to handle lists without backreferences.
...
You can achieve considerable speedups by:
1 using the grammar to generate code, and compiling and 
executing that instead of interpreting the grammar by hand
Definitely - come to think of it, I bet this could be done VERY nicely with
Python. Or most other sufficiently self-exposed languages... Hm.
...
2 allowing the grammar to contain both PEG expressions and 
regexps for low-level lexical matching: regexps will be at 
least an order of magnitude faster than even compiled PEGs 
for matching low-level lexical tokens like numbers and names, 
without removing the ability of PEGs to blur the distinction 
between lexical and syntactic analysis, which is important 
for parsing strange things like wikitext.
This sounds like a great idea for extended PEGs anyway... I'll remember that
if I end up building an mxTextTools frontend for PEGs, since mxTextTools can
easily hook into arbitrary matching functions (including regex).
...
I've implemented packrat parsing in both Python and Scheme: 
Scheme was faster, and ultimately more natural.
That's quite possible - the problem would be that I don't know Scheme, and I
am going to be extremely busy for the foreseeable future at school. I'd
rather not have to write a packrat parser myself, anyway... However simple
they may be, they improve drastically with optimizations, and I don't
anticipate having the time to implement a proper system.
Unless a good Python-accessible packrat parser already exists, I'm most
likely to just build a solid PEG frontend for mxTextTools. It's a very
powerful text parser, and tends to be fast (the module's mostly written in
C). I think it could easily support all PEG features. Actually, I think
SimpleParse (another mxTextTools frontend) already supports at least 90% of
PEG features, so maybe the best idea is simply to rework SimpleParse to use
standard PEG syntax instead of its extremely extended BNF variant.
...
I'm not sure about the best way to implement an API: have you 
considered just using the parser to convert from wikitext to 
somthing like PYX, which is a very simple-to-parse and 
Python-friendly representation of an XML data structure...
Something like that would probably be ideal, although I'd tend to prefer a
more abstract data structure that's programmatically accessible - maybe an
mxTextTools tag list (its normal output format) is closer to what I mean.
- Eric Astor
mxTextTools: http://www.egenix.com/files/python/mxTextTools.html
SimpleParse: http://simpleparse.sourceforge.net/simpleparse_grammars.html
-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.11.6/428 - Release Date: 8/25/2006

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikitext parsing (was RE: flexbisonparse)