That was quite amusing, I read the "Welcome to your new list" message before the wikitech-l message. Anyway, a list just for parser discussion is good.
Here's a bit of ANTLR grammar I wrote to handle basic article structure: paragraph blocks and "special blocks", where two consecutive blocks of the same type need an extra linefeed. Since I haven't written any Lex or Yacc before, I'm still wrestling a bit with what are probably fairly basic problems. In this case, I found the requirement of an extra linefeed quite challenging to implement without ambiguity problems.
As it is, this does work, but spews out a huge number of warnings and even an apparently non-fatal "fatal error". I presume some of these problems can be avoided through semantic and syntactic predicates, if not backtracking, memoization (no, that's not a typo). Any ANTLR experts here?
Steve
-- grammar paras;
article : pseries? (sseries (EOF| pseries))*; pseries : para (N+ para)* N*; sseries : specialblock (N+ specialblock)* N*;
specialblock : (spaceblock|listblock)+;
spaceblock : spaceline+;
spaceline : SPECIALCHAR char* N;
listblock : (listitem)+; listitem: (bulletitem | numberitem | indentitem | defitem);
bulletitem : BULLETCHAR (listitem | (nonlistchar char*)? N);
numberitem : NUMBERCHAR (listitem | (nonlistchar char*)? N);
indentitem : INDENTCHAR (listitem | (nonlistchar char*)? N);
defitem : DEFCHAR (nonindentchar)* (definition | INDENTCHAR? N ); definition : ':' char+ N;
BULLETCHAR: '*'; NUMBERCHAR: '#'; INDENTCHAR: ':'; DEFCHAR : ';';
para : (nonspecialchar char* N)+;
listchar: BULLETCHAR | NUMBERCHAR | INDENTCHAR | DEFCHAR;
SPECIALCHAR : ' '; nonlistchar : SPECIALCHAR | nonspecialchar; char : nonlistchar | listchar; nonindentchar : nonlistchar | BULLETCHAR | NUMBERCHAR | DEFCHAR; N : '\r'? '\n' ;
nowiki : NOWIKI; NOWIKI : '<nowiki>'( options {greedy=false;} : . )*'</nowiki>';
nonspecialchar : NONSPECIALCHAR | nowiki;
NONSPECIALCHAR : ('A'..'Z'| 'a'..'z' | '0'..'9' | ''' | '"' | '(' | ')')+; --
PS you might notice the above grammar implements two "improvements" to the ;definition:term notation:
1. The ;definition has to be the last item in the list. Constructs like ##;## are worthless. 2. A trailing : is treated literally.
On 17/11/2007, Steve Bennett stevagewp@gmail.com wrote:
Here's a bit of ANTLR grammar I wrote to handle basic article structure:
I see from [[:en:ANTLR]] that ANTLR compiles to C++, Java, Python and C# - not PHP. How feasible will it be to get PHP from this?
- d.
On 11/17/07, David Gerard dgerard@gmail.com wrote:
On 17/11/2007, Steve Bennett stevagewp@gmail.com wrote:
Here's a bit of ANTLR grammar I wrote to handle basic article structure:
I see from [[:en:ANTLR]] that ANTLR compiles to C++, Java, Python and C# - not PHP. How feasible will it be to get PHP from this?
Yeah, I addressed that in my second email. There are four roads I can think of:
1) Help implement the PHP target. 2) Compile to one of the other targets, then translate (possibly using an automated tool) 3) Translate the original grammar to Lex or whatever. 4) Compile to one of the other targets (eg, C) then link to that from the PHP code. Apparently that makes it harder for 3rd parties to run, but I can't really speak to why.
Option 3 is not so bad. We need a formal grammar. A formal grammar written in ANTLR is an incredibly useful thing, and if it's slightly inconvenient for our immediate parser-writing purposes, so be it. ANTLR is so expressive that whatever *other* mechanism we could be writing it in (eg, EBNF with English descriptions for semantically disambiguating ambiguous syntax), ANTLR syntax would *still* be a better way of expressing it, even if we don't use a parser directly generated by ANTLR.
Steve
On 17/11/2007, Steve Bennett stevagewp@gmail.com wrote:
- Help implement the PHP target.
- Compile to one of the other targets, then translate
(possibly using an automated tool) 3) Translate the original grammar to Lex or whatever.
Mmm. Whichever of these is used, you'd need a note in parser.php that "DO NOT PATCH DIRECTLY, THIS IS GENERATED CODE" and that parser changes should be made to the ANTLR or lex grammar.
- Compile to one of the other targets (eg, C) then link to that from the
PHP code. Apparently that makes it harder for 3rd parties to run, but I can't really speak to why.
As I understand it, the issue is hosted copies of MediaWiki where the user can only use PHP, not compile anything or run arbitrary binaries or touch httpd.conf.
I expect where a user *does* have compiler access, a C implementation would be the parser implementation of choice.
- d.
On 11/18/07, David Gerard dgerard@gmail.com wrote:
On 17/11/2007, Steve Bennett stevagewp@gmail.com wrote:
- Help implement the PHP target.
On second thoughts, definitely option 1.
http://www.antlr.org/wiki/display/ANTLR3/How+to+build+an+ANTLR+code+generati...
Supposedly "not that hard".
Mmm. Whichever of these is used, you'd need a note in parser.php that "DO NOT PATCH DIRECTLY, THIS IS GENERATED CODE" and that parser changes should be made to the ANTLR or lex grammar.
Well yeah. But it wouldn't be parser.php, it would be its own module.
As I understand it, the issue is hosted copies of MediaWiki where the user can only use PHP, not compile anything or run arbitrary binaries or touch httpd.conf.
Ah yes.
I expect where a user *does* have compiler access, a C implementation would be the parser implementation of choice.
I don't think it's easily possible to write both implementatinos in the one grammar, but there might be ways of doing it.
Steve
On 18/11/2007, Steve Bennett stevagewp@gmail.com wrote:
On 11/18/07, David Gerard dgerard@gmail.com wrote:
I expect where a user *does* have compiler access, a C implementation would be the parser implementation of choice.
I don't think it's easily possible to write both implementatinos in the one grammar, but there might be ways of doing it.
It wouldn't be too hard [*] to auto-convert from the C auto-generated by Antler to a PHP file. Two implementations of one grammar. ISTR seeing something about a converter; if not, doing it via the mid-step of Java (icky, I know, but plausible) would work.
[*] - This is not an offer. Sorry. :-)
Yours,
On 11/18/07, James Forrester jdforrester@gmail.com wrote:
It wouldn't be too hard [*] to auto-convert from the C auto-generated by Antler to a PHP file. Two implementations of one grammar. ISTR seeing something about a converter; if not, doing it via the mid-step of Java (icky, I know, but plausible) would work.
[*] - This is not an offer. Sorry. :-)
Heh. Well bizarrely enough we could actually make the target Java then auto-translate from Java to C and PHP. The advantage here is that the guy who wrote ANTLR seems to specialise in Java->X translators for some reason.
Anyway, in the meantime I'm assuming that functioning ANTLR grammar, for any target, is a huge step in the right direction.
Still stuck on some intracies of ANTLR. These three constructs seem to be different, but I don't know how or why exactly:
1) hello: 'hello';
2) hello: 'h' 'e' 'l' 'l' 'o';
3) hello: HELLO; HELLO: 'hello';
Steve
wikitext-l@lists.wikimedia.org