Re: [Wikitext-l] MediaWiki parser implementation

6 Aug 2010


      2010-08-06 10:57, Daniel Kinzler skrev:
...
Andreas Jonsson schrieb:
...
...
/me grabs pumping lemma
C'mon, don't be so pessimistic. :)  Since I'm writing a context aware
parser I don't see why you bring up the pumping lemma.  Unless you have
formed one for the class of languages that is accepted by the class of
parsers represented by the algorithm I'm using.
Sorry, didn't really look at the algorithm you propose. I just know that one
reasons previous attempts failed is that wikitext isn't context free - and thus
can't be parsed using an LL(*) grammar as used by Antlr. Didn't you say you'd
use Antrl? Perhaps a packrat parser would work...
I'm using an antlr generated front end that interacts with a context
object that provides the parser with hints.  I'm relying on that the
token stream is stable, so anything that might produce new tokens
after parsing have to be taken care of by a preprocessor, (removing
comments, inclusions, magic words, etc.).
As an example, to parse the apostrophe jungles, a prescan of the
token stream is performed to collect all apostrophe sequences.  Then
the token stream is rewinded, the context computes hints and do the
actual parsing. So the grammar productions in antlr look like this:
apostrophes: (apostrophe)+ { if(X->inlinePrescan) { 
X->onApostrophesPrescan(X, $text); } };
bold:   {X->takeBold}?=> (begin_bold   | end_bold);
italic: {X->takeItalic}?=> (begin_italic | end_italic);
apostrophe: {X->inlinePrescan}?=> A
     |       {X->takeApostrophe}?=> A {X->onConsumedApostrophes(X);} ;
begin_bold:   {!X->inBold}?=>   A A A {X->beginBold(X);}  ;
end_bold:     { X->inBold}?=>   A A A {X->endBold(X);}    ;
begin_italic: {!X->inItalic}?=> A A   {X->beginItalic(X);};
end_italic:   { X->inItalic}?=> A A   {X->endItalic(X);}  ;
Now, computing hints to reproduce the MediaWiki apostrophe parsing is
ridiculously complex (see the below method), but it is at least a 
separate concern.  Also, I
don't think that there is anything more complex in the syntax, so I
feel confident about the general idea of the antlr/context combination.
Best regards,
Andreas
static void
endInlinePrescan(MWPARSERCONTEXT *context)
{
     context->inlinePrescan = false;
if (context->apostropheSequences != NULL) {
         pANTLR3_VECTOR v = context->apostropheSequences;
int i;
int two = 0;
         int three = 0;
APOSTROPHE_SEQUENCE *victim = NULL;
for (i=0; i < v->count; i++) {
             APOSTROPHE_SEQUENCE *s = v->get(v, i);
             switch (s->sequence->len) {
             case 1:
                 break;
             case 2:
                 two++;
                 break;
             case 3:
             case 4:
                 three++;
                 victim = checkPotentialVictim(victim, s);
                 break;
             default:
                 two++;
                 three++;
                 break;
             }
}
if (two % 2 != 1 || three % 2 != 1) {
             victim = NULL;
         }
pANTLR3_VECTOR pi = context->parseInlineInstruction;
         pi->clear(pi);
int openingFive = -1;
         int italic = 0;
         int bold = 0;
#define APOSTROPHE pi->add(pi, &AD_APOSTROPHE_CONST, NULL)
#define ITALIC                                                          \
     do {                                                                \
         italic++;                                                       \
         pi->add(pi, &AD_ITALIC_CONST, NULL);                            \
         if (openingFive != -1) {                                        \
             /*                                                          \
              * Swap bold and italic.                                    \
              */                                                         \
             assert(*(APOSTROPHE_DIRECTION*)pi->get(pi, openingFive) == 
AD_ITALIC && \
                    *(APOSTROPHE_DIRECTION*)pi->get(pi, openingFive + 1) 
== AD_BOLD); \
             pi->swap(pi, openingFive, openingFive + 1);                 \
             openingFive = -1;                                           \
         }                                                               \
     } while (0)
#define BOLD                                    \
do {                                            \
         bold++;                                 \
         pi->add(pi, &AD_BOLD_CONST, NULL);      \
         openingFive = -1;                       \
} while (0)
for (i=0; i < v->count; i++) {
             APOSTROPHE_SEQUENCE *s = v->get(v, i);
             switch (s->sequence->len) {
             case 1:
                 APOSTROPHE;
                 break;
             case 2:
                 ITALIC;
                 break;
             case 3:
                 if (s == victim) {
                     APOSTROPHE; ITALIC;
                 } else {
                     BOLD;
                 }
                 break;
             case 4:
                 if (s == victim) {
                     APOSTROPHE; APOSTROPHE; ITALIC;
                 } else {
                     APOSTROPHE; BOLD;
                 }
                 break;
             default:
                 {
                     int j;
                     for (j = 0;j < s->sequence->len - 5; j++) {
                         APOSTROPHE;
                     }
                     if (italic % 2 == 0 && bold % 2 == 0) {
                         /*
                          * Five (or more) apostrophes opening up new
                          * formattings.  We may need to swap the order
                          * later, so we save the index.
                          */
                         openingFive = pi->count;
                     }
                     pi->add(pi, &AD_ITALIC_CONST, NULL);
                     italic++;
                     pi->add(pi, &AD_BOLD_CONST, NULL);
                     bold++;
                 }
                 break;
             }
         }
context->currentInlineInstruction = 0;
APOSTROPHE_DIRECTION ad = *(APOSTROPHE_DIRECTION*)pi->get(pi, 0);
context->takeApostrophe = ad == AD_APOSTROPHE;
         context->takeItalic = ad == AD_ITALIC;
         context->takeBold = ad == AD_BOLD;
context->apostropheSequences->free(context->apostropheSequences);
         context->apostropheSequences = NULL;
     }
}

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] MediaWiki parser implementation