I'm about to head off for a week and a half, so here's a quick progress stop. My ANTLR grammar so far is here:
http://www.mediawiki.org/wiki/User:Stevage/ANTLR
It does many features, but most aren't really complete.
Supports: * Internal links * External links (limited range of characters allowed) * Images (all options) * Headings (limits on ='s in the text) * Nowiki, pre * French punctuation ( foo ? -> foo ?) * HTML entities ( is recognised, &foo; is converted to literals) * Dangerous HTML, < -> < etc * Bold, italics (supports the basic rules, not the single-character stuff) * Paragraphs * Space-indented blocks * Lists (intentionally doesn't support nested ; lists, does support ;foo:blah) * ISBN, RFC, PMID (fully, I think)
Does not support: * Categories * Tables * Inline HTML (<b>, <div> etc) * __TOC__ etc * HTML comments
Other limitations: * Very reduced ranges of characters for many things, like it doesn't know that é is a letter rather than punctuation, for instance * Case sensitivity in some places (<NOWIKI> is not recognised)
At the moment, it simply builds an AST, but converting from that AST to HTML should be pretty trivial. I have mind some simply tree-cleaning steps first, like concatenating consecutive P blocks into one (I'm using BR to indicate a gap of two or more new lines), concatenating consecutive OL etc.
I offer this up just for curiosity's sake - no one should try and hack on it ;)
[hrm, on closer inspection, that's not the latest version of that file. oh well.]
Steve
On 11/12/2007, Steve Bennett stevagewp@gmail.com wrote:
I'm about to head off for a week and a half, so here's a quick progress stop. My ANTLR grammar so far is here: http://www.mediawiki.org/wiki/User:Stevage/ANTLR It does many features, but most aren't really complete. I offer this up just for curiosity's sake - no one should try and hack on it ;) [hrm, on closer inspection, that's not the latest version of that file. oh well.]
You should link the above from the ANTLR page and include this email at the top of it.
- d.
On 12/11/07, David Gerard dgerard@gmail.com wrote:
I'm about to head off for a week and a half, so here's a quick progress stop. My ANTLR grammar so far is here: http://www.mediawiki.org/wiki/User:Stevage/ANTLR It does many features, but most aren't really complete. I offer this up just for curiosity's sake - no one should try and hack on it ;) [hrm, on closer inspection, that's not the latest version of that file. oh well.]
You should link the above from the ANTLR page and include this email at the top of it.
It's a wiki isn't it? Feel free. :)
This is still very much work in progress and hasn't been tidied up at all. I would be interested to hear whether anyone finds this ANTLR grammar readable and meaningful at all. If the grammar is not expressive and readable, there's not much point having it.
I'm especially troubled by the syntactic predicates which seem to be required to suppress warnings by the ANTLR compiler. These are the ones that look like:
rule: (option1) => option1 | (option2) => option2;
Most of the time this behaves exactly the same as:
rule: option1 | option2;
but if option1 and option2 can match the same input, then ANTLR will generate a warning if the syntactic predicates aren't there. However, with the syntactic predicates it ends up parsing the text twice (I think) - once to check whether the predicate will succeed, then once for real. It's a pretty annoying trade-off: readability and performance vs no warnings and certainty of execution path.
I'm also a bit concerned about the eventual performance of this thing. Already parsing a page of wikitext seems to take a very, very long time (eg, 10 seconds), but I don't know how much of that is caused by the environment (Java JVM), the debugger etc. And of course my grammar is pretty inefficient in many ways.
Steve
wikitext-l@lists.wikimedia.org