Wikitext-l January 2011

wikitext-l@lists.wikimedia.org

3 participants
5 discussions

A specialized lexer generator for wikitext
by Andreas Jonsson 26 Jan '11

26 Jan '11

(Starting a paragraph with the words "From my" for some reason seems to be a problem for my email client. Sorry about that.) Hi, My experiments with an Antlr based parser have led me to believe it is possible to write a formal parser for Mediawiki syntax that is sufficiently compatible with the original to be used as a replacement. But it is, however, questionable whether using Antlr is the best option for implementing such a parser. It has several drawbacks: It is hard to follow the code in the lexical analyser. The algorithm used by Antlr doesn't match the requirements for wikitext very well. I would guess that my implementation contains many errors where the lexer will fail to accept the input. Since Antlr is not designed for languages where everything is valid input, it is very hard to find these errors by inspecting the code. Also an an external tool for generating semantic predicates is necessary so the description of the lexical rules are divided into two separate parts. Furthermore, for a recursive descendent algorithm is not the fastest options when it comes to parsing. If the token stream is sufficiently refined a LALR parser can be used instead. I believe that it is always possible to transform a wikitext to a token stream that can be easily parsed. Thus, I would propose writing a specialized lexer generator that is suitable for generating a lexer that does precisely this. As a consequence, the wikitext specification would be more compact; it would be easier to maintain for multiple target languages and the resulting parser would be smaller and faster than an Antlr generated one. The grammar over the resulting token stream will be context free and parsable with any standard parsing algorithm. Here are my preliminary thoughts on the requirement of the lexer generator: 1. The input to the lexer generator is a table of tokens where the properties of each token is described. 2. The output from the lexer generator is a sequence of tokens. Each token may be associated with a string and a vector of attributes. 3. Suitable pattern matching algorithm. There should be four basic classes of characters: 1. New line characters 2. Special characters (that needs escaping) 3. Ordinary characters 4. Ignored characters If no token production rule matches, the characters should just be lumped together into tokens of longest sequence of characters of the same class. A bonus objective is to make these character classes runtime configurable, since the set of characters that need escaping differs for different applications. A token production rule consists of a simple regular pattern that is used for preliminary matching of a candidate production rule. The matched text can in a second step be matched by a PCRE or otherwise processed to finally decide whether to accept the production or not. 4. Toggling token productions. The context is effectively defined by the set of enabled token productions, which means there are a very large number of different contexts. Therefore we need a more dynamic context mechanism than what are supported by tools like flex and javacc. We need a mechanism where individual token productions can be dynamically disabled and enabled. The code for doing this will be inferred from the properties of the tokens and generated. 5. Late token determination. The semantic meaning of a sequence of apostrophes cannot be resolved until the end of a text line. The most efficient method to handle this special case is to produce a set of pseudo tokens that can be resolved into specific tokens when reaching the end of line. No lookahead would be required for this. In my previous implementation I solved apostrophe parsing by parser lookahead, which is costly. This mechanism alone seemed to account for about 30% of the execution time of the parser. Furthermore, a token representing the start of an internal link can be further resolved into a "red link" or "blue link" at the end of the lexical analysis. 6. Speculative execution. Lookahead is problematic for context sensitive analysis. Instead we use speculative execution. Thus, there must be a mechanism to save and restore the full state of the analysis, as well as the state of input and output streams. 7. Simple action language Only a few operations needs to be supported in the actions. To simplify code generation for multiple languages, it is a good idea to define a simple language for the actions that is independent of target language to avoid having to maintain the actions per target language. 8. Multilingual code generation It is desirable to support at least PHP, Javascript and C. The lexer generator will generate code that assumes that a runtime environment that implements the following APIs exist in the target languages: // Input public interface InputStream { InputStreamMark mark(); void rewind(InputStreamMark mark); void advance(int characters); String getText(InputStreamMark mark); int getLine(); int getColumn(); } // Token management enum TokenType { ... } public interface TokenFactory { <T extends Token> T newToken(TokenType tokenType); } public interface Token { void setText(String text); String getText(); void setPosition(int line, int column); } public interface AttributeToken extends Token { void setAttributes(AttributeList attributeList); } public interface HeadingToken extends AttributeToken { void setLevel(int level); } public interface LinkToken extends Token { void setLinkTarget(String linkTarget); } public interface InternalLinkToken extends LinkToken { void setLinkType(LinkType linkType); void setTargetURL(String url); } public interface TagToken extends AttributeToken { void setTagName(String name); } // Output token stream public interface TokenStream { void put(Token token); TokenStreamMark mark(); Token takeBack(); void discard(TokenStreamMark mark); } // Application interaction public interface Application { boolean isValidLink(String text); boolean isValidURL(String text); boolean linkPrefixesEnabled(); Set<String> getURLProtocols(); Set<String> getBlockTags(); Set<String> getInlineTags(); void resolveLinks(Set<InternalLinkToken> linkTokens); } Best regards, Andreas Jonsson

2 1

A specialized lexical analysator generator for wikitext
by Andreas Jonsson 26 Jan '11

26 Jan '11

Hi, (Sorry for the multiple posts, but I seem to have a problem with my mail client.) >From my experiments with an Antlr based parser, I am personally convinced that it is possible to write a formal parser for Mediawiki syntax that is sufficiently compatible with the original to be used as a replacement. But it is, however, questionable whether using Antlr is the best option for implementing such a parser. It has several drawbacks: It is hard to follow the code in the lexical analyser. The algorithm used by Antlr doesn't match the requirements for wikitext very well. I would guess that my implementation contains many errors where the lexer will fail to accept the input. Since Antlr is not designed for languages where everything is valid input, it is very hard to find these errors by inspecting the code. Also an an external tool for generating semantic predicates is necessary so the description of the lexical rules are divided into two separate parts. Furthermore, for a recursive descendent algorithm is not the fastest options when it comes to parsing. If the token stream is sufficiently refined a LALR parser can be used instead. I believe that it is always possible to transform a wikitext to a token stream that can be easily parsed. Thus, I would propose writing a specialized lexical analyser generator that is suitable for generating a lexer that does precisely this. As a consequence, the wikitext specification would be more compact; it would be easier to maintain for multiple target languages and the resulting parser would be smaller and faster than an Antlr generated one. The grammar over the resulting token stream will be context free and parsable with any standard parsing algorithm. Here are my preliminary thoughts on the requirement of the lexer generator: 1. The input to the lexer generator is a table of tokens where the properties of each token is described. 2. The output from the lexer generator is a sequence of tokens. Each token may be associated with a string and a vector of attributes. 3. Suitable pattern matching algorithm. There should be four basic classes of characters: 1. New line characters 2. Special characters (that needs escaping) 3. Ordinary characters 4. Ignored characters If no token production rule matches, the characters should just be lumped together into tokens of longest sequence of characters of the same class. A bonus objective is to make these character classes runtime configurable, since the set of special characters differs for different applications. A token production rule consists of a simple regular pattern that is used for preliminary matching of a candidate production rule. The matched text can in a second step be matched by a PCRE or otherwise processed to finally decide whether to accept the production or not. 4. Toggling token productions depending on context. The context is effectively defined by the set of enabled token productions, which means there are a very large number of different contexts. Therefore we need a more dynamic context mechanism than what are supported by tools like flex and javacc. We need a mechanism where individual token productions can be dynamically disabled and enabled. 5. Late token determination. The semantic meaning of a sequence of apostrophes cannot be resolved until the end of a text line. The most efficient method to handle this special case is to produce a set of pseudo tokens that can be resolved into specific tokens when reaching the end of line. No lookahead would be required for this. In my previous implementation I solved apostrophe parsing by parser lookahead, which is costly. This mechanism alone seemed to account for about 30% of the execution time of the parser. Furthermore, a token representing the start of an internal link can be further resolved into a "red link" or "blue link" at the end of the lexical analysis. 6. Speculative execution. Lookahead is problematic for context sensitive analysis. Instead we use speculative execution. Thus, there must be a mechanism to save and restore the full state of the analysis, as well as the state of input and output streams. 7. Simple action language Only a few operations needs to be supported in the actions. To simplify code generation for multiple languages, it is a good idea to define a simple language for the actions that is independent of target language to avoid having to maintain the actions per target language. 8. Multilingual code generation It is desirable to support at least PHP, Javascript and C. The lexer generator will generate code that assumes that a runtime environment that implements the following APIs exist in the target languages: // Input public interface InputStream { InputStreamMark mark(); void rewind(InputStreamMark mark); void advance(int characters); String getText(InputStreamMark mark); int getLine(); int getColumn(); } // Token management enum TokenType { ... } public interface TokenFactory { <T extends Token> T newToken(TokenType tokenType); } public interface Token { void setText(String text); String getText(); void setPosition(int line, int column); } public interface AttributeToken extends Token { void setAttributes(AttributeList attributeList); } public interface HeadingToken extends AttributeToken { void setLevel(int level); } public interface LinkToken extends Token { void setLinkTarget(String linkTarget); } public interface InternalLinkToken extends LinkToken { void setLinkType(LinkType linkType); void setTargetURL(String url); } public interface TagToken extends AttributeToken { void setTagName(String name); } // Output token stream public interface TokenStream { void put(Token token); TokenStreamMark mark(); Token takeBack(); void discard(TokenStreamMark mark); } // Application interaction public interface Application { boolean isValidLink(String text); boolean isValidURL(String text); boolean linkPrefixesEnabled(); Set<String> getURLProtocols(); Set<String> getBlockTags(); Set<String> getInlineTags(); void resolveLinks(Set<InternalLinkToken> linkTokens); }

1 0

A specialized lexer generator for wikitext
by Andreas Jonsson 26 Jan '11

26 Jan '11

Hi, >From my experiments with an Antlr based parser, I am personally convinced that it is possible to write a formal parser for Mediawiki syntax that is sufficiently compatible with the original to be used as a replacement. But it is, however, questionable whether using Antlr is the best option for implementing such a parser. It has several drawbacks: It is hard to follow the code in the lexical analyser. The algorithm used by Antlr doesn't match the requirements for wikitext very well. I would guess that my implementation contains many errors where the lexer will fail to accept the input. Since Antlr is not designed for languages where everything is valid input, it is very hard to find these errors by inspecting the code. Also an an external tool for generating semantic predicates is necessary so the description of the lexical rules are divided into two separate parts. Furthermore, for a recursive descendent algorithm is not the fastest options when it comes to parsing. If the token stream is sufficiently refined a LALR parser can be used instead. I believe that it is always possible to transform a wikitext to a token stream that can be easily parsed. Thus, I would propose writing a specialized lexer generator that is suitable for generating a lexer that does precisely this. As a consequence, the wikitext specification would be more compact; it would be easier to maintain for multiple target languages and the resulting parser would be smaller and faster than an Antlr generated one. The grammar over the resulting token stream will be context free and parsable with any standard parsing algorithm. Here are my preliminary thoughts on the requirement of the lexer generator: 1. The input to the lexer generator is a table of tokens where the properties of each token is described. 2. The output from the lexer generator is a sequence of tokens. Each token may be associated with a string and a vector of attributes. 3. Suitable pattern matching algorithm. There should be four basic classes of characters: 1. New line characters 2. Special characters (that needs escaping) 3. Ordinary characters 4. Ignored characters If no token production rule matches, the characters should just be lumped together into tokens of longest sequence of characters of the same class. A bonus objective is to make these character classes runtime configurable, since the set of characters that need escaping differs for different applications. A token production rule consists of a simple regular pattern that is used for preliminary matching of a candidate production rule. The matched text can in a second step be matched by a PCRE or otherwise processed to finally decide whether to accept the production or not. 4. Toggling token productions. The context is effectively defined by the set of enabled token productions, which means there are a very large number of different contexts. Therefore we need a more dynamic context mechanism than what are supported by tools like flex and javacc. We need a mechanism where individual token productions can be dynamically disabled and enabled. The code for doing this will be inferred from the properties of the tokens and generated. 5. Late token determination. The semantic meaning of a sequence of apostrophes cannot be resolved until the end of a text line. The most efficient method to handle this special case is to produce a set of pseudo tokens that can be resolved into specific tokens when reaching the end of line. No lookahead would be required for this. In my previous implementation I solved apostrophe parsing by parser lookahead, which is costly. This mechanism alone seemed to account for about 30% of the execution time of the parser. Furthermore, a token representing the start of an internal link can be further resolved into a "red link" or "blue link" at the end of the lexical analysis. 6. Speculative execution. Lookahead is problematic for context sensitive analysis. Instead we use speculative execution. Thus, there must be a mechanism to save and restore the full state of the analysis, as well as the state of input and output streams. 7. Simple action language Only a few operations needs to be supported in the actions. To simplify code generation for multiple languages, it is a good idea to define a simple language for the actions that is independent of target language to avoid having to maintain the actions per target language. 8. Multilingual code generation It is desirable to support at least PHP, Javascript and C. The lexer generator will generate code that assumes that a runtime environment that implements the following APIs exist in the target languages: // Input public interface InputStream { InputStreamMark mark(); void rewind(InputStreamMark mark); void advance(int characters); String getText(InputStreamMark mark); int getLine(); int getColumn(); } // Token management enum TokenType { ... } public interface TokenFactory { <T extends Token> T newToken(TokenType tokenType); } public interface Token { void setText(String text); String getText(); void setPosition(int line, int column); } public interface AttributeToken extends Token { void setAttributes(AttributeList attributeList); } public interface HeadingToken extends AttributeToken { void setLevel(int level); } public interface LinkToken extends Token { void setLinkTarget(String linkTarget); } public interface InternalLinkToken extends LinkToken { void setLinkType(LinkType linkType); void setTargetURL(String url); } public interface TagToken extends AttributeToken { void setTagName(String name); } // Output token stream public interface TokenStream { void put(Token token); TokenStreamMark mark(); Token takeBack(); void discard(TokenStreamMark mark); } // Application interaction public interface Application { boolean isValidLink(String text); boolean isValidURL(String text); boolean linkPrefixesEnabled(); Set<String> getURLProtocols(); Set<String> getBlockTags(); Set<String> getInlineTags(); void resolveLinks(Set<InternalLinkToken> linkTokens); } Best regards, Andreas Jonsson

1 0

A specialized lexer generator for wikitext
by Andreas Jonsson 26 Jan '11

26 Jan '11

1 0

Wikitext Madness of the Day: Internal Links
by Helder 08 Jan '11

08 Jan '11

[Wikitext-l] Wikitext Madness of the Day: Internal Links Andreas Jonsson andreas.jonsson at kreablo.se Thu Aug 12 07:47:51 UTC 2010 2010-08-12 09:40, Daniel Kinzler skrev: > Tomaž Šolc schrieb: > >> Hi >> >> >>> (The actual regexp is using (.+?), but using an empty string as link >>> text will still produce a link for some strange reason.) >>> >> Empty string is actually a special case: such links will use the name of >> the target page minus the namespace prefix for the anchor. >> > but that's done on save, i guess before the actual parsing. > > Pipe trick syntax: [[User:Foo|]] will turn into [[User:Foo|Foo]] upon save. > > OK, that explains it. > Yay arcane special cases. > > Yay, utomatically fixing the user input. That'll make things a lot easier. :-) /Andreas But see also bug 5004 ("Keep the pipe trick ([[Piped link|]]) syntax in wikitext instead of converting it at pre-save transform"): https://bugzilla.wikimedia.org/show_bug.cgi?id=5004 Helder

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Wikitext-l January 2011