[Wikitext-l] On tokenizing wikitext

Sun Aug 22 15:09:04 UTC 2010

The parser I am writing follows a four layered design:

- client application
- parser context
- parser
- lexical scanner

In this post I will write about the lexical scanner and what the
special requirements are for wikitext.  Since all input is valid
wikitext, the lexer must never fail to accept a character in any
position or state.  Also, it is important that the lexer protects the
parser from spurious tokens.  The parser grammar will quickly become
very complex if it has to deal with unexpected tokens.  It is
therefore beneficial to make the lexer a bit more complex and make
sure that, for instance, a </li> tag appearing in a particular context
is really supposed to be a html tag token or if it should rather be
just the verbatim characters.  To accomplish this two things are
required: lookahead and the ability to selectively enable/disable
individual token productions.

== Basic tokens

At the very bottom of the lexer there are 7 kinds of tokens:

SPACE_TAB:      SPACE_TAB_CHAR+;
NEWLINE:        '\r\n' | NEWLINE_CHAR;
SPECIAL:        SPECIAL_SYMBOL;
APOS:           '\'';
ASCII_WORD:     ASCII_LETTER+;
WORD            
~(SPACE_TAB_CHAR|NEWLINE_CHAR|SPECIAL_SYMBOL|'\''|ASCII_LETTER|NON_PRINTABLE_CHAR)+
NON_PRINTABLE:  NON_PRINTABLE_CHAR+ {$channel = HIDDEN;}

These tokens types covers all characters in the Unicode set, any
character will match into exactly one of the above.  The SPECIAL rule
matches exactly one character out of a set of characters that may be
used to form more advanced tokens; it captures instances of such
characters that are not part of a token.  Other lexer rules may also
fall back to producing SPECIAL tokens, which may contain multiple
characters.  The apostrophes are handled by the parser, so we will
pass it as a token of its own.  The ASCII_WORD needs to be treated
separately in order to support internal link prefixes and trails.  A
word is very loosely defined as a sequence of characters that do not
fall into any other category of characters.  Non printable characters
are hidden from the parser.

If we would remove all other lexer rules except the above basic ones,
the result of the parsing would be a text split into paragraphs where
all markup is rendered verbatim and where non-printable characters
have been stripped away.  All other kinds of formatting are triggered
by the lexer producing a more complex token.

== Lookahead

Some tokens aren't tokens unless followed by a proper closing token.
Thus, we must sometimes perform a lookahead in the input stream before
accepting a token.

1. Syntactic predicates cannot be used for lookahead.  E.g., the rule
    ('ab')=> 'a' will match 'a' regardless of what follows.  The
    trailing 'b' in the lookahead is silently ignored by Antlr.

2. Using Antlr's builtin backtracking has two major problems:

    1. The DFA tables becomes extremely large.

    2. Passing arguments to lexer rule will not work.  (May be specific
       to the C target.)

Instead I am using the following pattern to perform lookahead:

RULE
@init{
     bool success;
}:
    MATCHING_CHARACTERS RULE_LOOKAHEAD[&success]
    {
        if (!success) {
            // Fall back to instead producing a SPECIAL token, or 
whatever else might be appropriate.
        }
    }
    ;

RULE_LOOKAHEAD[bool *success]
@init {
     ANTLR3_MARKER mark;
}:
    {
        mark = MARK();
    }
    (
       (ALTERNATIVE1 {*success = true;})
     | (ALTERNATIVE2 {*success = false;})
     | ...
    )
    {
        REWIND(mark);
    }
    ;

This is a bit clumsy syntactically, but it does the job.

When falling back to producing different tokens, it may sometimes be
convenient to be able to produce several tokens at once.  This is,
however, not supported by the C target.  I have provided a patch for
this, but Jim Idle said that it could not be included in the main
distribution of libantlr3c, since, I guess, it would break some other
things, so we may not want to rely on this.

== Selectively enabling and disabling token productions.

By maintaining a lexer context and using the semantic predicates in
Antlr, it is possible to completely avoid generating spurious tokens.
The basic pattern is as follows:

RULE: {!CX->ruleDisabled}?=> RULE_PATTERN { onRule(CX); }
     ;

So, there is a predicate 'ruleDisabled' that guards the rule, and a
method 'onRule' that reports to the context that the rule was taken.
To take a specific example:

BEGIN_TABLE:
     {BOL && !CX->wikitextTableOpenDisabled}?=> (SKIP_SPACE '{|')=> 
SKIP_SPACE
     BEGIN_TABLE_INTERNAL
     ;

fragment
BEGIN_TABLE_INTERNAL
@init{
     pANTLR3_VECTOR attrs = NULL;
}:
     '{|' {onWikitextTableOpen(CX);}
     ATTRIBUTE_LIST_TABLE[&attrs]
     { CUSTOM = attrs; }
     ;

The logic that governs the 'ruleDisabled' predicates is fairly complex
and writing the code for this directly is a very error prone process.
Instead, I have opted to write a table with a description of each
predicate and let a script generate the code.  For instance, the
description for wikitext table open and close predicates looks like
this:

     array(
           'name'              =>  'wikitextTableOpen',
           'close'             =>  'wikitextTableClose',
           'initiallyDisabled' =>  array(),
           'affects'           =>  array(new TypeEnable('wikitextTable', 
'WIKITEXTTABLE')),
           'mayNest'           =>  true,
           'haveNestingCount'  =>  true,
           'types'             =>  array('block', 'wikitextTable'),
           'pushesBlockContext'=>  true,
           ),

(Maybe not immediately clear what each attribute means, but I have
provided documentation for these in the source file.)  This is one of
the most complex predicates to describe.  The script would generate
the methods onWikitextTableOpen/Close and enable/disableWikitextTable.
They will look something like this:

static inline void onWikitextTableOpen(struct MWLEXERCONTEXT_struct 
*context) {
     context->wikitextTableOpenNestingLevel++;
     enableWikitextTable(context, DC_WIKITEXTTABLE);
     context->wikitextTableCloseDisabled &= ~ DC_OPENED;
     pushBlockContext(context, &context->wikitextTableOpenDisabled);
}

static inline void onWikitextTableClose(struct MWLEXERCONTEXT_struct 
*context) {
     context->wikitextTableOpenNestingLevel--;
     if (context->wikitextTableOpenNestingLevel == 0) {
         context->wikitextTableCloseDisabled |= DC_OPENED;
         disableWikitextTable(context, DC_WIKITEXTTABLE);
         context->wikitextTableOpenDisabled &= ~ DC_WIKITEXTTABLE;
     }
     popBlockContext(context, &context->wikitextTableOpenDisabled);
}

=== Block Context

The block context is a generalization of the table context concept
that I have mentioned in a previous email.

The option 'pushesBlockContext' that is visible in the above example
states that the opening token corresponding to a particular predicate
will create a new "block context" and push this to a stack.  The
corresponding end token will pop the stack and restore the previous
block context.

When a particular block context is activated, the corresponding
closing token will be activated, as well as any tokens internal to the
particular block context.  E.g., the '<ul>' token will enable the
'</ul>' and the '<li>' token.  Deactivating a block context will
disable the tokens again.

So, pushing a new block context includes deactivating the current
context and activating the new one; popping includes deactivating the
current block context and activating the previous one, if any.

As a consequence, the html tokens '<li>', '<td>', '<tr>' etc. and the
wikitext table tokens '|-', '|', '||' etc. will be disabled and the
corresponding text string rendered verbatim if occuring outside the
appropriate context.  Also, block elements that pushes the block
context must be closed with the correct end token.  For example:

<ul>
<li><div>text</li>
</ul>

will be rendered as:

<ul><li><div>text&lt;/li&gt; &lt;/ul&gt; </div></li></ul>

Since the <div> is still opened, the end tokens for the list will not
be active.  The tokens are instead closed at the end of the article.

By restricting the block contexts in this way, lots of problems are
avoided in the parser.  Since the behavior doesn't seem to be well
defined in the current parser, I would guess that this would be
acceptable.

To summarize this rather long post, the lexer is the most complex
part, but by letting it do a thorough filtering of the input, the
parser's job becomes much simpler.

Best regards,

Andreas Jonsson