I like the WikiRover, but something IMO it needs is a formal parser for wikitext. I haven't looked at MediaWiki 1.3, but last time I looked at the parser (1.1 maybe?) it wasn't actually a parser, but a bunch of regular expressions applied to the flat Wiki file, with some hacks like replacing math sections with a unique text string to avoid them getting clobbered. All that makes repurposing it for other things a bit difficult. As a side note, it also seems to make extending it difficult---it seems to be the reason (unless I'm missing something else) for limitations like "you can't have links inside of image captions", because regular expressions have a more limited expressiveness than context-free grammars do, so can't distinguish a ]] closing an internal link from the ]] closing the image.
I've been thinking of doing it for a while, but my main hang-up, apart from lack of time, is the lack of a good parser-generator for the full class of context-free grammars. Most require you to have LALR(1) grammars, and maintaining the wikimarkup specification in such a form, not to mention getting it there in the first place, would be a nightmare, since wikitext isn't particularly designed with it in mind like many programming languages are (it's hard to even scan to unambiguously find terminals with a lexer in wikitext). One tool that does both take the full set of grammars and has no separate lexer is "ratpack parsing", which was some guy's master's thesis recently, and has been implemented so far in Haskell and Java. It's very fast---O(n)---but also takes O(n) space, where 'n' is the size of the document being parsed, which isn't so good (LALR(1) parsers are O(k) where k is the maximum nesting depth). A ratpack parser on one of the larger Wikipedia articles (say, 100kb) would take around 0.5-1 second and 4MB of RAM to parse. That's fine for offline generation, but would be impossible to use on wikipedia.org, and it'd be ideal if eventually we could have one grammar that is used for everything, instead of keeping differently-specified things in approximate sync.
The other possibilities I've found are: 1. Bite the bullet and try to shove wikitext into LALR(1). Not very fun, and might not even be possible. 2. Write a hand-coded pseudo-recursive-descent parser (but the nature of wikitext means this requires unbounded lookahead to resolve ambiguities) 3. Use a GLR parser-generator like Berkeley's Elkhound. This might be doable, but Elkhound is a bit hard to use. Or I may just not have looked enough. 4. Use ratpack parsing, but change things up so articles get parsed on edit instead of on view, which is only on the order of a few tens of thousands per day, and have views generated from pre-parsed abstract representations (or perhaps even already-generated HTML-with-blanks that just needs link-coloring and date-format preferences filled in).
Anyone have any thoughts in this direction, or suggestions? Is this even worth doing at all? It seems like having wikitext formally specified would be nice, because it would allow for easy extensions, like the mentioned "links inside of image captions" example, and easy retargetting to any other sort of output format. But doing it for wikitext seems difficult---most programming languages are specifically designed with clean lexing followed by LALR(1) parsing in mind. That's not meant to be a criticism of wikitext btw---it's clearly supposed to be person-readable, with machine readability being a distant second---but it does make it rather difficult to deal with given the current state of parsing technology.
-Mark