The parser I am writing follows a four layered design:
- client application - parser context - parser - lexical scanner
In this post I will write about the lexical scanner and what the special requirements are for wikitext. Since all input is valid wikitext, the lexer must never fail to accept a character in any position or state. Also, it is important that the lexer protects the parser from spurious tokens. The parser grammar will quickly become very complex if it has to deal with unexpected tokens. It is therefore beneficial to make the lexer a bit more complex and make sure that, for instance, a </li> tag appearing in a particular context is really supposed to be a html tag token or if it should rather be just the verbatim characters. To accomplish this two things are required: lookahead and the ability to selectively enable/disable individual token productions.
== Basic tokens
At the very bottom of the lexer there are 7 kinds of tokens:
SPACE_TAB: SPACE_TAB_CHAR+; NEWLINE: '\r\n' | NEWLINE_CHAR; SPECIAL: SPECIAL_SYMBOL; APOS: '''; ASCII_WORD: ASCII_LETTER+; WORD ~(SPACE_TAB_CHAR|NEWLINE_CHAR|SPECIAL_SYMBOL|'''|ASCII_LETTER|NON_PRINTABLE_CHAR)+ NON_PRINTABLE: NON_PRINTABLE_CHAR+ {$channel = HIDDEN;}
These tokens types covers all characters in the Unicode set, any character will match into exactly one of the above. The SPECIAL rule matches exactly one character out of a set of characters that may be used to form more advanced tokens; it captures instances of such characters that are not part of a token. Other lexer rules may also fall back to producing SPECIAL tokens, which may contain multiple characters. The apostrophes are handled by the parser, so we will pass it as a token of its own. The ASCII_WORD needs to be treated separately in order to support internal link prefixes and trails. A word is very loosely defined as a sequence of characters that do not fall into any other category of characters. Non printable characters are hidden from the parser.
If we would remove all other lexer rules except the above basic ones, the result of the parsing would be a text split into paragraphs where all markup is rendered verbatim and where non-printable characters have been stripped away. All other kinds of formatting are triggered by the lexer producing a more complex token.
== Lookahead
Some tokens aren't tokens unless followed by a proper closing token. Thus, we must sometimes perform a lookahead in the input stream before accepting a token.
1. Syntactic predicates cannot be used for lookahead. E.g., the rule ('ab')=> 'a' will match 'a' regardless of what follows. The trailing 'b' in the lookahead is silently ignored by Antlr.
2. Using Antlr's builtin backtracking has two major problems:
1. The DFA tables becomes extremely large.
2. Passing arguments to lexer rule will not work. (May be specific to the C target.)
Instead I am using the following pattern to perform lookahead:
RULE @init{ bool success; }: MATCHING_CHARACTERS RULE_LOOKAHEAD[&success] { if (!success) { // Fall back to instead producing a SPECIAL token, or whatever else might be appropriate. } } ;
RULE_LOOKAHEAD[bool *success] @init { ANTLR3_MARKER mark; }: { mark = MARK(); } ( (ALTERNATIVE1 {*success = true;}) | (ALTERNATIVE2 {*success = false;}) | ... ) { REWIND(mark); } ;
This is a bit clumsy syntactically, but it does the job.
When falling back to producing different tokens, it may sometimes be convenient to be able to produce several tokens at once. This is, however, not supported by the C target. I have provided a patch for this, but Jim Idle said that it could not be included in the main distribution of libantlr3c, since, I guess, it would break some other things, so we may not want to rely on this.
== Selectively enabling and disabling token productions.
By maintaining a lexer context and using the semantic predicates in Antlr, it is possible to completely avoid generating spurious tokens. The basic pattern is as follows:
RULE: {!CX->ruleDisabled}?=> RULE_PATTERN { onRule(CX); } ;
So, there is a predicate 'ruleDisabled' that guards the rule, and a method 'onRule' that reports to the context that the rule was taken. To take a specific example:
BEGIN_TABLE: {BOL && !CX->wikitextTableOpenDisabled}?=> (SKIP_SPACE '{|')=> SKIP_SPACE BEGIN_TABLE_INTERNAL ;
fragment BEGIN_TABLE_INTERNAL @init{ pANTLR3_VECTOR attrs = NULL; }: '{|' {onWikitextTableOpen(CX);} ATTRIBUTE_LIST_TABLE[&attrs] { CUSTOM = attrs; } ;
The logic that governs the 'ruleDisabled' predicates is fairly complex and writing the code for this directly is a very error prone process. Instead, I have opted to write a table with a description of each predicate and let a script generate the code. For instance, the description for wikitext table open and close predicates looks like this:
array( 'name' => 'wikitextTableOpen', 'close' => 'wikitextTableClose', 'initiallyDisabled' => array(), 'affects' => array(new TypeEnable('wikitextTable', 'WIKITEXTTABLE')), 'mayNest' => true, 'haveNestingCount' => true, 'types' => array('block', 'wikitextTable'), 'pushesBlockContext'=> true, ),
(Maybe not immediately clear what each attribute means, but I have provided documentation for these in the source file.) This is one of the most complex predicates to describe. The script would generate the methods onWikitextTableOpen/Close and enable/disableWikitextTable. They will look something like this:
static inline void onWikitextTableOpen(struct MWLEXERCONTEXT_struct *context) { context->wikitextTableOpenNestingLevel++; enableWikitextTable(context, DC_WIKITEXTTABLE); context->wikitextTableCloseDisabled &= ~ DC_OPENED; pushBlockContext(context, &context->wikitextTableOpenDisabled); }
static inline void onWikitextTableClose(struct MWLEXERCONTEXT_struct *context) { context->wikitextTableOpenNestingLevel--; if (context->wikitextTableOpenNestingLevel == 0) { context->wikitextTableCloseDisabled |= DC_OPENED; disableWikitextTable(context, DC_WIKITEXTTABLE); context->wikitextTableOpenDisabled &= ~ DC_WIKITEXTTABLE; } popBlockContext(context, &context->wikitextTableOpenDisabled); }
=== Block Context
The block context is a generalization of the table context concept that I have mentioned in a previous email.
The option 'pushesBlockContext' that is visible in the above example states that the opening token corresponding to a particular predicate will create a new "block context" and push this to a stack. The corresponding end token will pop the stack and restore the previous block context.
When a particular block context is activated, the corresponding closing token will be activated, as well as any tokens internal to the particular block context. E.g., the '<ul>' token will enable the '</ul>' and the '<li>' token. Deactivating a block context will disable the tokens again.
So, pushing a new block context includes deactivating the current context and activating the new one; popping includes deactivating the current block context and activating the previous one, if any.
As a consequence, the html tokens '<li>', '<td>', '<tr>' etc. and the wikitext table tokens '|-', '|', '||' etc. will be disabled and the corresponding text string rendered verbatim if occuring outside the appropriate context. Also, block elements that pushes the block context must be closed with the correct end token. For example:
<ul> <li><div>text</li> </ul>
will be rendered as:
<ul><li><div>text</li> </ul> </div></li></ul>
Since the <div> is still opened, the end tokens for the list will not be active. The tokens are instead closed at the end of the article.
By restricting the block contexts in this way, lots of problems are avoided in the parser. Since the behavior doesn't seem to be well defined in the current parser, I would guess that this would be acceptable.
To summarize this rather long post, the lexer is the most complex part, but by letting it do a thorough filtering of the input, the parser's job becomes much simpler.
Best regards,
Andreas Jonsson
On 22 August 2010 16:09, Andreas Jonsson andreas.jonsson@kreablo.se wrote:
The parser I am writing follows a four layered design:
... I feel like I just looked Cthulhu in the eye.
If you survive this, you'll deserve a Wikipedia holiday declared in your name.
- d.
2010-08-22 17:34, David Gerard skrev:
On 22 August 2010 16:09, Andreas Jonssonandreas.jonsson@kreablo.se wrote:
The parser I am writing follows a four layered design:
... I feel like I just looked Cthulhu in the eye.
:-)
The complexity can't really be helped, though. I can't find any simpler way of structuring the parser.
If you survive this, you'll deserve a Wikipedia holiday declared in your name.
I have already implemented the lexer the way I described it. It works fine.
I did some profiling and its still the parser that is the slowest component, not the lexer. I find that a bit surprising myself.
The current status is that it supports all html listed in Sanitizer.php except <pre>, <img>, and <hr>, lists, tables, headings, table of contents, indented text, apostrophe formatting, nowiki, horizontal rule, internal links (except that the link title isn't validated, and the trail and prefix isn't implemented).
As far as I can see, the major missing things are image/media links and external links. Also, the indented text shouldn't be considered as such if it contains a block html element. Thus, I must introduce another lookahead for this.
I think I have solved the really hard parts. Now its just a matter of polishing the details. I'll try to upload the whole thing to Wikimedia's subversion repository sometimes in the next few days.
/Andreas
I think I have solved the really hard parts. Now its just a matter of polishing the details. I'll try to upload the whole thing to Wikimedia's subversion repository sometimes in the next few days.
How do you handle template (interpretation)?
Thanks, Dirk
2010-08-24 01:19, Dirk Riehle skrev:
I think I have solved the really hard parts. Now its just a matter of polishing the details. I'll try to upload the whole thing to Wikimedia's subversion repository sometimes in the next few days.
How do you handle template (interpretation)?
I am not doing any preprocessing at all. My goal with this parser is to have something that can be plugged in after preprocessing. Writing a preprocessor would be a separate project.
/Andreas
wikitext-l@lists.wikimedia.org