Hi,
(Sorry for the multiple posts, but I seem to have a problem with my mail client.)
From my experiments with an Antlr based parser, I am personally
convinced that it is possible to write a formal parser for Mediawiki
syntax that is sufficiently compatible with the original to be used as
a replacement. But it is, however, questionable whether using Antlr
is the best option for implementing such a parser. It has several
drawbacks: It is hard to follow the code in the lexical analyser. The
algorithm used by Antlr doesn't match the requirements for wikitext
very well. I would guess that my implementation contains many errors
where the lexer will fail to accept the input. Since Antlr is not
designed for languages where everything is valid input, it is very
hard to find these errors by inspecting the code. Also an an
external tool for generating semantic predicates is necessary so the
description of the lexical rules are divided into two separate parts.
Furthermore, for a recursive descendent algorithm is not the fastest
options when it comes to parsing. If the token stream is sufficiently
refined a LALR parser can be used instead.
I believe that it is always possible to transform a wikitext to a
token stream that can be easily parsed. Thus, I would propose writing
a specialized lexical analyser generator that is suitable for
generating a lexer that does precisely this.
As a consequence, the wikitext specification would be more compact; it
would be easier to maintain for multiple target languages and the
resulting parser would be smaller and faster than an Antlr generated
one. The grammar over the resulting token stream will be context free
and parsable with any standard parsing algorithm.
Here are my preliminary thoughts on the requirement of the lexer
generator:
1. The input to the lexer generator is a table of tokens where the
properties of each token is described.
2. The output from the lexer generator is a sequence of tokens. Each
token may be associated with a string and a vector of attributes.
3. Suitable pattern matching algorithm.
There should be four basic classes of characters:
1. New line characters
2. Special characters (that needs escaping)
3. Ordinary characters
4. Ignored characters
If no token production rule matches, the characters should just be
lumped together into tokens of longest sequence of characters of
the same class.
A bonus objective is to make these character classes runtime
configurable, since the set of special characters differs for
different applications.
A token production rule consists of a simple regular pattern that
is used for preliminary matching of a candidate production rule.
The matched text can in a second step be matched by a PCRE or
otherwise processed to finally decide whether to accept the
production or not.
4. Toggling token productions depending on context.
The context is effectively defined by the set of enabled token
productions, which means there are a very large number of different
contexts. Therefore we need a more dynamic context mechanism than
what are supported by tools like flex and javacc. We need a
mechanism where individual token productions can be dynamically
disabled and enabled.
5. Late token determination.
The semantic meaning of a sequence of apostrophes cannot be
resolved until the end of a text line. The most efficient method
to handle this special case is to produce a set of pseudo tokens
that can be resolved into specific tokens when reaching the end of
line. No lookahead would be required for this.
In my previous implementation I solved apostrophe parsing by parser
lookahead, which is costly. This mechanism alone seemed to account
for about 30% of the execution time of the parser.
Furthermore, a token representing the start of an internal link can
be further resolved into a "red link" or "blue link" at the end of
the lexical analysis.
6. Speculative execution.
Lookahead is problematic for context sensitive analysis. Instead
we use speculative execution. Thus, there must be a mechanism to
save and restore the full state of the analysis, as well as the
state of input and output streams.
7. Simple action language
Only a few operations needs to be supported in the actions. To
simplify code generation for multiple languages, it is a good idea
to define a simple language for the actions that is independent of
target language to avoid having to maintain the actions per target
language.
8. Multilingual code generation
It is desirable to support at least PHP, Javascript and C.
The lexer generator will generate code that assumes that a runtime
environment that implements the following APIs exist in the target
languages:
// Input
public interface InputStream {
InputStreamMark mark();
void rewind(InputStreamMark mark);
void advance(int characters);
String getText(InputStreamMark mark);
int getLine();
int getColumn();
}
// Token management
enum TokenType { ... }
public interface TokenFactory {
<T extends Token> T newToken(TokenType tokenType);
}
public interface Token {
void setText(String text);
String getText();
void setPosition(int line, int column);
}
public interface AttributeToken extends Token {
void setAttributes(AttributeList attributeList);
}
public interface HeadingToken extends AttributeToken {
void setLevel(int level);
}
public interface LinkToken extends Token {
void setLinkTarget(String linkTarget);
}
public interface InternalLinkToken extends LinkToken {
void setLinkType(LinkType linkType);
void setTargetURL(String url);
}
public interface TagToken extends AttributeToken {
void setTagName(String name);
}
// Output token stream
public interface TokenStream {
void put(Token token);
TokenStreamMark mark();
Token takeBack();
void discard(TokenStreamMark mark);
}
// Application interaction
public interface Application {
boolean isValidLink(String text);
boolean isValidURL(String text);
boolean linkPrefixesEnabled();
Set<String> getURLProtocols();
Set<String> getBlockTags();
Set<String> getInlineTags();
void resolveLinks(Set<InternalLinkToken> linkTokens);
}