2010-08-06 10:57, Daniel Kinzler skrev:
Andreas Jonsson schrieb:
/me grabs pumping lemma
C'mon, don't be so pessimistic. :) Since I'm writing a context aware parser I don't see why you bring up the pumping lemma. Unless you have formed one for the class of languages that is accepted by the class of parsers represented by the algorithm I'm using.
Sorry, didn't really look at the algorithm you propose. I just know that one reasons previous attempts failed is that wikitext isn't context free - and thus can't be parsed using an LL(*) grammar as used by Antlr. Didn't you say you'd use Antrl? Perhaps a packrat parser would work...
I'm using an antlr generated front end that interacts with a context object that provides the parser with hints. I'm relying on that the token stream is stable, so anything that might produce new tokens after parsing have to be taken care of by a preprocessor, (removing comments, inclusions, magic words, etc.).
As an example, to parse the apostrophe jungles, a prescan of the token stream is performed to collect all apostrophe sequences. Then the token stream is rewinded, the context computes hints and do the actual parsing. So the grammar productions in antlr look like this:
apostrophes: (apostrophe)+ { if(X->inlinePrescan) { X->onApostrophesPrescan(X, $text); } };
bold: {X->takeBold}?=> (begin_bold | end_bold); italic: {X->takeItalic}?=> (begin_italic | end_italic); apostrophe: {X->inlinePrescan}?=> A | {X->takeApostrophe}?=> A {X->onConsumedApostrophes(X);} ;
begin_bold: {!X->inBold}?=> A A A {X->beginBold(X);} ; end_bold: { X->inBold}?=> A A A {X->endBold(X);} ; begin_italic: {!X->inItalic}?=> A A {X->beginItalic(X);}; end_italic: { X->inItalic}?=> A A {X->endItalic(X);} ;
Now, computing hints to reproduce the MediaWiki apostrophe parsing is ridiculously complex (see the below method), but it is at least a separate concern. Also, I don't think that there is anything more complex in the syntax, so I feel confident about the general idea of the antlr/context combination.
Best regards,
Andreas
static void endInlinePrescan(MWPARSERCONTEXT *context) { context->inlinePrescan = false;
if (context->apostropheSequences != NULL) { pANTLR3_VECTOR v = context->apostropheSequences;
int i;
int two = 0; int three = 0;
APOSTROPHE_SEQUENCE *victim = NULL;
for (i=0; i < v->count; i++) { APOSTROPHE_SEQUENCE *s = v->get(v, i); switch (s->sequence->len) { case 1: break; case 2: two++; break; case 3: case 4: three++; victim = checkPotentialVictim(victim, s); break; default: two++; three++; break; }
}
if (two % 2 != 1 || three % 2 != 1) { victim = NULL; }
pANTLR3_VECTOR pi = context->parseInlineInstruction; pi->clear(pi);
int openingFive = -1; int italic = 0; int bold = 0;
#define APOSTROPHE pi->add(pi, &AD_APOSTROPHE_CONST, NULL)
#define ITALIC \ do { \ italic++; \ pi->add(pi, &AD_ITALIC_CONST, NULL); \ if (openingFive != -1) { \ /* \ * Swap bold and italic. \ */ \ assert(*(APOSTROPHE_DIRECTION*)pi->get(pi, openingFive) == AD_ITALIC && \ *(APOSTROPHE_DIRECTION*)pi->get(pi, openingFive + 1) == AD_BOLD); \ pi->swap(pi, openingFive, openingFive + 1); \ openingFive = -1; \ } \ } while (0)
#define BOLD \ do { \ bold++; \ pi->add(pi, &AD_BOLD_CONST, NULL); \ openingFive = -1; \ } while (0)
for (i=0; i < v->count; i++) { APOSTROPHE_SEQUENCE *s = v->get(v, i); switch (s->sequence->len) { case 1: APOSTROPHE; break; case 2: ITALIC; break; case 3: if (s == victim) { APOSTROPHE; ITALIC; } else { BOLD; } break; case 4: if (s == victim) { APOSTROPHE; APOSTROPHE; ITALIC; } else { APOSTROPHE; BOLD; } break; default: { int j; for (j = 0;j < s->sequence->len - 5; j++) { APOSTROPHE; } if (italic % 2 == 0 && bold % 2 == 0) { /* * Five (or more) apostrophes opening up new * formattings. We may need to swap the order * later, so we save the index. */ openingFive = pi->count; } pi->add(pi, &AD_ITALIC_CONST, NULL); italic++; pi->add(pi, &AD_BOLD_CONST, NULL); bold++; } break; } }
context->currentInlineInstruction = 0;
APOSTROPHE_DIRECTION ad = *(APOSTROPHE_DIRECTION*)pi->get(pi, 0);
context->takeApostrophe = ad == AD_APOSTROPHE; context->takeItalic = ad == AD_ITALIC; context->takeBold = ad == AD_BOLD;
context->apostropheSequences->free(context->apostropheSequences); context->apostropheSequences = NULL; } }