In my previous post I covered the lexer. Here I will describe the parser, the parser context and the listener interface. After the lexer's extensive job att providing a reasonably well formed token stream, the parser's job becomes completely straightforward.
== The parser
For inlined elemements, the parser will just mindlessly report these to the context object:
inline_element: word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag
;
space: token = SPACE_TAB {IE(CX->onSpace(CX, $token->getText($token));)} ;
etc.
The lexer guarantees that a closing token will not appear before a corresponding opening token, and the parser context takes care of nesting formats and removing empty format tags.
For block elements, the only special thing the parser need to pay attention to is the fact that end tokens may be missing. Therefore, end-of-file is always accepted instead of the closing token, for instance:
html_div: token = HTML_DIV_OPEN { CX->beginHtmlDiv(CX, $token->custom); } block_element_contents (HTML_DIV_CLOSE|EOF) { CX->endHtmlDiv(CX); } ;
The rule 'block_element_contents' covers all parser productions. The lexer will restrict which tokens that may appear. For instance 'HTML_DIV_CLOSE' will never appear before a corresponding 'HTML_DIV_OPEN'. Also, list items and table cells will not appear unless the current block context is correct. I have also introduced a max nesting level limit in the lexer, so stack space is also not an issue.
== The parser context
The parser context relays the parser events to a listener, but it will insert and remove events to produce a well formed output. For instance:
text '' italic <b><strong /> bold-italic bold </b> text
will result in an event stream to the listener that will look like this:
text <i> italic <b> bold-italic </b></i> <b> bold </b> text
Two mechanisms are used to implement this:
* The call to the "begin" method is delayed until some actual inlined content is produced. The call is never taken if an "end" event is recieved before such content.
* The order of the formats is maintained so that inner formats can be closed and reopened when a non-matching end token is recieved.
So, most of the parser context's methods look like this:
static void beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr) { MW_DELAYED_CALL( context, beginHtmlStrong, endHtmlStrong, attr, NULL); MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, attr, NULL, false); MWLISTENER *l = &context->listener; l->beginHtmlStrong(l, attr); }
static void endHtmlStrong(MWPARSERCONTEXT *context) { MW_SKIP_IF_EMPTY( context, beginHtmlStrong, endHtmlStrong, NULL); MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL); MWLISTENER *l = &context->listener; l->endHtmlStrong(l); }
Block elements are already guaranteed by the lexer to be well nested, so the context typically does not need to do anything special about those. Only the wikitext list elements needs to be resolved by the context.
== The listener
The listening application needs to implement the MWLISTENER interface. I haven't added support for all features yet, but at the moment, there are 91 methods in this interface. They are trivial to implement, though. The only thing to think about is that it is the listener's responsibility to escape the contents of nowiki and special characters, and also to filter the attribute lists.
/Andreas
"Andreas Jonsson" andreas.jonsson@kreablo.se wrote in message news:4C72738C.80704@kreablo.se...
- The call to the "begin" method is delayed until some actual inlined content is produced. The call is never taken if an "end" event is recieved before such content.
Does this mean that constructs such as <span id="JSPlaceholder"></span> are obliterated by the lexer? Some empty inline (and block) elements may have an important purpose as a JS DOM hook, and should not be removed from the output stream.
- Mark Clements (HappyDog)
2010-09-02 15:15, Mark Clements (HappyDog) skrev:
"Andreas Jonsson"andreas.jonsson@kreablo.se wrote in message news:4C72738C.80704@kreablo.se...
- The call to the "begin" method is delayed until some actual inlined content is produced. The call is never taken if an "end" event is recieved before such content.
Does this mean that constructs such as<span id="JSPlaceholder"></span> are obliterated by the lexer? Some empty inline (and block) elements may have an important purpose as a JS DOM hook, and should not be removed from the output stream.
Yes, that is correct. This is what the original parser does for <i> and <b>. But now when you mention it, I realize that this is probably just an artefact of cleaning up the apostrophe mess.
I changed it so that inlined empty html elements are always included.
/Andreas
"Andreas Jonsson" andreas.jonsson@kreablo.se wrote in message news:4C7FD17A.7000906@kreablo.se...
2010-09-02 15:15, Mark Clements (HappyDog) skrev:
"Andreas Jonsson"andreas.jonsson@kreablo.se wrote in message news:4C72738C.80704@kreablo.se...
- The call to the "begin" method is delayed until some actual inlined content is produced. The call is never taken if an "end" event is recieved before such content.
Does this mean that constructs such as<span id="JSPlaceholder"></span> are obliterated by the lexer? Some empty inline (and block) elements may have an important purpose as a JS DOM hook, and should not be removed from the output stream.
Yes, that is correct. This is what the original parser does for <i> and <b>. But now when you mention it, I realize that this is probably just an artefact of cleaning up the apostrophe mess.
I changed it so that inlined empty html elements are always included.
That sounds sensible. Any HTML inserted manually should be left in place (possibly tidied - e.g. addition of closing tags - but not removed). It's only the generated HTML that should (arguably) be cleaned up in this way. If the user doesn't want the empty tag, then they can edit the page to remove it.
- Mark Clements (HappyDog).
wikitext-l@lists.wikimedia.org