Hi,
(Sorry for the multiple posts, but I seem to have a problem with my mail client.)
From my experiments with an Antlr based parser, I am personally
convinced that it is possible to write a formal parser for Mediawiki syntax that is sufficiently compatible with the original to be used as a replacement. But it is, however, questionable whether using Antlr is the best option for implementing such a parser. It has several drawbacks: It is hard to follow the code in the lexical analyser. The algorithm used by Antlr doesn't match the requirements for wikitext very well. I would guess that my implementation contains many errors where the lexer will fail to accept the input. Since Antlr is not designed for languages where everything is valid input, it is very hard to find these errors by inspecting the code. Also an an external tool for generating semantic predicates is necessary so the description of the lexical rules are divided into two separate parts. Furthermore, for a recursive descendent algorithm is not the fastest options when it comes to parsing. If the token stream is sufficiently refined a LALR parser can be used instead.
I believe that it is always possible to transform a wikitext to a token stream that can be easily parsed. Thus, I would propose writing a specialized lexical analyser generator that is suitable for generating a lexer that does precisely this.
As a consequence, the wikitext specification would be more compact; it would be easier to maintain for multiple target languages and the resulting parser would be smaller and faster than an Antlr generated one. The grammar over the resulting token stream will be context free and parsable with any standard parsing algorithm.
Here are my preliminary thoughts on the requirement of the lexer generator:
1. The input to the lexer generator is a table of tokens where the properties of each token is described.
2. The output from the lexer generator is a sequence of tokens. Each token may be associated with a string and a vector of attributes.
3. Suitable pattern matching algorithm.
There should be four basic classes of characters:
1. New line characters
2. Special characters (that needs escaping)
3. Ordinary characters
4. Ignored characters
If no token production rule matches, the characters should just be lumped together into tokens of longest sequence of characters of the same class.
A bonus objective is to make these character classes runtime configurable, since the set of special characters differs for different applications.
A token production rule consists of a simple regular pattern that is used for preliminary matching of a candidate production rule. The matched text can in a second step be matched by a PCRE or otherwise processed to finally decide whether to accept the production or not.
4. Toggling token productions depending on context.
The context is effectively defined by the set of enabled token productions, which means there are a very large number of different contexts. Therefore we need a more dynamic context mechanism than what are supported by tools like flex and javacc. We need a mechanism where individual token productions can be dynamically disabled and enabled.
5. Late token determination.
The semantic meaning of a sequence of apostrophes cannot be resolved until the end of a text line. The most efficient method to handle this special case is to produce a set of pseudo tokens that can be resolved into specific tokens when reaching the end of line. No lookahead would be required for this.
In my previous implementation I solved apostrophe parsing by parser lookahead, which is costly. This mechanism alone seemed to account for about 30% of the execution time of the parser.
Furthermore, a token representing the start of an internal link can be further resolved into a "red link" or "blue link" at the end of the lexical analysis.
6. Speculative execution.
Lookahead is problematic for context sensitive analysis. Instead we use speculative execution. Thus, there must be a mechanism to save and restore the full state of the analysis, as well as the state of input and output streams.
7. Simple action language
Only a few operations needs to be supported in the actions. To simplify code generation for multiple languages, it is a good idea to define a simple language for the actions that is independent of target language to avoid having to maintain the actions per target language.
8. Multilingual code generation
It is desirable to support at least PHP, Javascript and C.
The lexer generator will generate code that assumes that a runtime environment that implements the following APIs exist in the target languages:
// Input
public interface InputStream { InputStreamMark mark(); void rewind(InputStreamMark mark); void advance(int characters); String getText(InputStreamMark mark); int getLine(); int getColumn(); }
// Token management
enum TokenType { ... }
public interface TokenFactory { <T extends Token> T newToken(TokenType tokenType); }
public interface Token { void setText(String text); String getText(); void setPosition(int line, int column); }
public interface AttributeToken extends Token { void setAttributes(AttributeList attributeList); }
public interface HeadingToken extends AttributeToken { void setLevel(int level); }
public interface LinkToken extends Token { void setLinkTarget(String linkTarget); }
public interface InternalLinkToken extends LinkToken { void setLinkType(LinkType linkType); void setTargetURL(String url); }
public interface TagToken extends AttributeToken { void setTagName(String name); }
// Output token stream
public interface TokenStream { void put(Token token); TokenStreamMark mark(); Token takeBack(); void discard(TokenStreamMark mark); }
// Application interaction
public interface Application { boolean isValidLink(String text); boolean isValidURL(String text); boolean linkPrefixesEnabled(); Set<String> getURLProtocols(); Set<String> getBlockTags(); Set<String> getInlineTags(); void resolveLinks(Set<InternalLinkToken> linkTokens); }