That was quite amusing, I read the "Welcome to your new list" message before
the wikitech-l message. Anyway, a list just for parser discussion is good.
Here's a bit of ANTLR grammar I wrote to handle basic article structure:
paragraph blocks and "special blocks", where two consecutive blocks of the
same type need an extra linefeed. Since I haven't written any Lex or Yacc
before, I'm still wrestling a bit with what are probably fairly basic
problems. In this case, I found the requirement of an extra linefeed quite
challenging to implement without ambiguity problems.
As it is, this does work, but spews out a huge number of warnings and even
an apparently non-fatal "fatal error". I presume some of these problems can
be avoided through semantic and syntactic predicates, if not backtracking,
memoization (no, that's not a typo). Any ANTLR experts here?
Steve
--
grammar paras;
article : pseries? (sseries (EOF| pseries))*;
pseries : para (N+ para)* N*;
sseries : specialblock (N+ specialblock)* N*;
specialblock
: (spaceblock|listblock)+;
spaceblock
: spaceline+;
spaceline
: SPECIALCHAR char* N;
listblock
: (listitem)+;
listitem: (bulletitem | numberitem | indentitem | defitem);
bulletitem
: BULLETCHAR (listitem | (nonlistchar char*)? N);
numberitem
: NUMBERCHAR (listitem | (nonlistchar char*)? N);
indentitem
: INDENTCHAR (listitem | (nonlistchar char*)? N);
defitem
: DEFCHAR (nonindentchar)* (definition | INDENTCHAR? N );
definition
: ':' char+ N;
BULLETCHAR: '*';
NUMBERCHAR: '#';
INDENTCHAR: ':';
DEFCHAR : ';';
para : (nonspecialchar char* N)+;
listchar: BULLETCHAR | NUMBERCHAR | INDENTCHAR | DEFCHAR;
SPECIALCHAR
: ' ';
nonlistchar
: SPECIALCHAR | nonspecialchar;
char : nonlistchar | listchar;
nonindentchar
: nonlistchar | BULLETCHAR | NUMBERCHAR | DEFCHAR;
N : '\r'? '\n' ;
nowiki : NOWIKI;
NOWIKI : '<nowiki>'( options {greedy=false;} : .
)*'</nowiki>';
nonspecialchar
: NONSPECIALCHAR | nowiki;
NONSPECIALCHAR
: ('A'..'Z'| 'a'..'z' | '0'..'9' |
'\'' | '"' | '(' | ')')+;
--
PS you might notice the above grammar implements two "improvements" to the
;definition:term notation:
1. The ;definition has to be the last item in the list. Constructs like
##;## are worthless.
2. A trailing : is treated literally.