I'm about to head off for a week and a half, so here's a quick progress stop. My ANTLR grammar so far is here:
http://www.mediawiki.org/wiki/User:Stevage/ANTLR
It does many features, but most aren't really complete.
Supports: * Internal links * External links (limited range of characters allowed) * Images (all options) * Headings (limits on ='s in the text) * Nowiki, pre * French punctuation ( foo ? -> foo ?) * HTML entities ( is recognised, &foo; is converted to literals) * Dangerous HTML, < -> < etc * Bold, italics (supports the basic rules, not the single-character stuff) * Paragraphs * Space-indented blocks * Lists (intentionally doesn't support nested ; lists, does support ;foo:blah) * ISBN, RFC, PMID (fully, I think)
Does not support: * Categories * Tables * Inline HTML (<b>, <div> etc) * __TOC__ etc * HTML comments
Other limitations: * Very reduced ranges of characters for many things, like it doesn't know that é is a letter rather than punctuation, for instance * Case sensitivity in some places (<NOWIKI> is not recognised)
At the moment, it simply builds an AST, but converting from that AST to HTML should be pretty trivial. I have mind some simply tree-cleaning steps first, like concatenating consecutive P blocks into one (I'm using BR to indicate a gap of two or more new lines), concatenating consecutive OL etc.
I offer this up just for curiosity's sake - no one should try and hack on it ;)
[hrm, on closer inspection, that's not the latest version of that file. oh well.]
Steve