[Wikitech-l] parsing wikitext

24 May 2004


      I like the WikiRover, but something IMO it needs is a formal parser for 
wikitext.  I haven't looked at MediaWiki 1.3, but last time I looked at 
the parser (1.1 maybe?) it wasn't actually a parser, but a bunch of 
regular expressions applied to the flat Wiki file, with some hacks like 
replacing math sections with a unique text string to avoid them getting 
clobbered.  All that makes repurposing it for other things a bit 
difficult.  As a side note, it also seems to make extending it 
difficult---it seems to be the reason (unless I'm missing something 
else) for limitations like "you can't have links inside of image 
captions", because regular expressions have a more limited 
expressiveness than context-free grammars do, so can't distinguish a ]] 
closing an internal link from the ]] closing the image.
I've been thinking of doing it for a while, but my main hang-up, apart 
from lack of time, is the lack of a good parser-generator for the full 
class of context-free grammars.  Most require you to have LALR(1) 
grammars, and maintaining the wikimarkup specification in such a form, 
not to mention getting it there in the first place, would be a 
nightmare, since wikitext isn't particularly designed with it in mind 
like many programming languages are (it's hard to even scan to 
unambiguously find terminals with a lexer in wikitext).  One tool that 
does both take the full set of grammars and has no separate lexer is 
"ratpack parsing", which was some guy's master's thesis recently, and 
has been implemented so far in Haskell and Java.  It's very 
fast---O(n)---but also takes O(n) space, where 'n' is the size of the 
document being parsed, which isn't so good (LALR(1) parsers are O(k) 
where k is the maximum nesting depth).  A ratpack parser on one of the 
larger Wikipedia articles (say, 100kb) would take around 0.5-1 second 
and 4MB of RAM to parse.  That's fine for offline generation, but would 
be impossible to use on wikipedia.org, and it'd be ideal if eventually 
we could have one grammar that is used for everything, instead of 
keeping differently-specified things in approximate sync.
The other possibilities I've found are:
 1. Bite the bullet and try to shove wikitext into LALR(1).  Not very 
fun, and might not even be possible.
 2. Write a hand-coded pseudo-recursive-descent parser (but the nature 
of wikitext means this requires unbounded lookahead to resolve ambiguities)
 3. Use a GLR parser-generator like Berkeley's Elkhound.  This might be 
doable, but Elkhound is a bit hard to use.  Or I may just not have 
looked enough.
 4. Use ratpack parsing, but change things up so articles get parsed on 
edit instead of on view, which is only on the order of a few tens of 
thousands per day, and have views generated from pre-parsed abstract 
representations (or perhaps even already-generated HTML-with-blanks that 
just needs link-coloring and date-format preferences filled in).
Anyone have any thoughts in this direction, or suggestions?  Is this 
even worth doing at all?  It seems like having wikitext formally 
specified would be nice, because it would allow for easy extensions, 
like the mentioned "links inside of image captions" example, and easy 
retargetting to any other sort of output format.  But doing it for 
wikitext seems difficult---most programming languages are specifically 
designed with clean lexing followed by LALR(1) parsing in mind.  That's 
not meant to be a criticism of wikitext btw---it's clearly supposed to 
be person-readable, with machine readability being a distant 
second---but it does make it rather difficult to deal with given the 
current state of parsing technology.
-Mark

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] parsing wikitext