A specialized lexer generator for wikitext - Wikitext-l

26 Jan 2011


      Hi,
...
From my experiments with an Antlr based parser, I am personally
convinced that it is possible to write a formal parser for Mediawiki
syntax that is sufficiently compatible with the original to be used as
a replacement.  But it is, however, questionable whether using Antlr
is the best option for implementing such a parser.  It has several
drawbacks: It is hard to follow the code in the lexical analyser.  The
algorithm used by Antlr doesn't match the requirements for wikitext
very well.  I would guess that my implementation contains many errors
where the lexer will fail to accept the input.  Since Antlr is not
designed for languages where everything is valid input, it is very
hard to find these errors by inspecting the code.  Also an an
external tool for generating semantic predicates is necessary so the
description of the lexical rules are divided into two separate parts.
Furthermore, for a recursive descendent algorithm is not the fastest
options when it comes to parsing.  If the token stream is sufficiently
refined a LALR parser can be used instead.
I believe that it is always possible to transform a wikitext to a
token stream that can be easily parsed.  Thus, I would propose writing
a specialized lexer generator that is suitable for
generating a lexer that does precisely this.
As a consequence, the wikitext specification would be more compact; it
would be easier to maintain for multiple target languages and the
resulting parser would be smaller and faster than an Antlr generated
one.  The grammar over the resulting token stream will be context free
and parsable with any standard parsing algorithm.
Here are my preliminary thoughts on the requirement of the lexer
generator:
1. The input to the lexer generator is a table of tokens where the
   properties of each token is described.
2. The output from the lexer generator is a sequence of tokens.  Each
   token may be associated with a string and a vector of attributes.
3. Suitable pattern matching algorithm.
There should be four basic classes of characters:
1. New line characters
2. Special characters (that needs escaping)
3. Ordinary characters
4. Ignored characters
If no token production rule matches, the characters should just be
   lumped together into tokens of longest sequence of characters of
   the same class.
A bonus objective is to make these character classes runtime
   configurable, since the set of characters that need escaping differs for
   different applications.
A token production rule consists of a simple regular pattern that
   is used for preliminary matching of a candidate production rule.
   The matched text can in a second step be matched by a PCRE or
   otherwise processed to finally decide whether to accept the
   production or not.
4. Toggling token productions.
The context is effectively defined by the set of enabled token
   productions, which means there are a very large number of different
   contexts. Therefore we need a more dynamic context mechanism than
   what are supported by tools like flex and javacc.  We need a
   mechanism where individual token productions can be dynamically
   disabled and enabled.  The code for doing this will be inferred from
   the properties of the tokens and generated.
5. Late token determination.
The semantic meaning of a sequence of apostrophes cannot be
   resolved until the end of a text line.  The most efficient method
   to handle this special case is to produce a set of pseudo tokens
   that can be resolved into specific tokens when reaching the end of
   line.  No lookahead would be required for this.
In my previous implementation I solved apostrophe parsing by parser
   lookahead, which is costly.  This mechanism alone seemed to account
   for about 30% of the execution time of the parser.
Furthermore, a token representing the start of an internal link can
   be further resolved into a "red link" or "blue link" at the end of
   the lexical analysis.
6. Speculative execution.
Lookahead is problematic for context sensitive analysis.  Instead
   we use speculative execution.  Thus, there must be a mechanism to
   save and restore the full state of the analysis, as well as the
   state of input and output streams.
7. Simple action language
Only a few operations needs to be supported in the actions.  To
   simplify code generation for multiple languages, it is a good idea
   to define a simple language for the actions that is independent of
   target language to avoid having to maintain the actions per target
   language.
8. Multilingual code generation
It is desirable to support at least PHP, Javascript and C.
The lexer generator will generate code that assumes that a runtime
environment that implements the following APIs exist in the target
languages:
// Input
public interface InputStream {
    InputStreamMark mark();
    void rewind(InputStreamMark mark);
    void advance(int characters);
    String getText(InputStreamMark mark);
    int getLine();
    int getColumn();
}
// Token management
enum TokenType { ... }
public interface TokenFactory {
    <T extends Token> T newToken(TokenType tokenType);
}
public interface Token {
    void setText(String text);
    String getText();
    void setPosition(int line, int column);
}
public interface AttributeToken extends Token {
    void setAttributes(AttributeList attributeList);
}
public interface HeadingToken extends AttributeToken {
    void setLevel(int level);
}
public interface LinkToken extends Token {
    void setLinkTarget(String linkTarget);
}
public interface InternalLinkToken extends LinkToken {
    void setLinkType(LinkType linkType);
    void setTargetURL(String url);
}
public interface TagToken extends AttributeToken {
    void setTagName(String name);
}
// Output token stream
public interface TokenStream {
    void put(Token token);
    TokenStreamMark mark();
    Token takeBack();
    void discard(TokenStreamMark mark);
}
// Application interaction
public interface Application {
    boolean isValidLink(String text);
    boolean isValidURL(String text);
    boolean linkPrefixesEnabled();
    Set<String> getURLProtocols();
    Set<String> getBlockTags();
    Set<String> getInlineTags();
    void resolveLinks(Set<InternalLinkToken> linkTokens);
}
Best regards,
Andreas Jonsson