Parsing of image links - Wikitext-l

9 Sep 2010

The syntax of image links with caption is seriously flawed, but I
think that I have found a reasonable solution for handling them: parse
them as "inline blocks".

To make an inline block out of the image link with caption, we first
let it have its own block context in the lexer, in order to guarantee
nexting order of internal block elements.  This means that the end
token cannot appear in the wrong block context:

   [[File:example.jpg|<table><td> this ]] is not an end token
   for the image link</table> but this ]] is

I have already discussed the image links in the context of speculative
execution in the lexer, to guarantee that any opened image link will
be followed by an image link closing token.  The max nesting level for
links is limited to 2 to avoid pathological speculations.

In the parser, inline blocks may appear in inlined text lines.  They
will break the inlined text line from the point of view of handling
apostrophe parsing, however.  Since block elements may appear in the
image caption, they cannot be part of the lookahead that is performed
for scanning for apostrophes.  This means that in this example:

   text '' italic [[File:example.jpg| text ]] foo '' bar

the text "text '' italic" and the text " foo '' bar"
are processed
separately when it comes to apostrophe parsing and the result will be:

<p>text <i> italic</i><a ...><img ..></a>foo <i>
bar </i></p>

Which is different from the current parser, where we have:

<p>text <i> italic<a ...><img ..></a>foo </i>
bar</p>

However, the behavior will be the same regardless of new lines in the
caption:

   text '' italic [[File:example.jpg| text
   text ]] foo '' bar

still:

<p>text <i> italic</i><a ...><img ..></a>foo <i>
bar </i></p>

The original parser have problems:

<p>text <i> italic<a ...><img ..></a>foo  bar
</i></i></p>

(My guess is that it first renders the </i> inside of the alt
attribute, which is cleaned up in the attribute sanitizing, and then
it discovers that there is a missing </i> and adds that in.)

In the original parser, wikitext list elements cannot appear in image
captions.  It would, of course, be very easy to just disable the
wikitext list tokens in the lexer to provide the same behavior, but
this seems a bit inconsistent as any other block element may appear in
the caption.  If we instead, in the parser, push/pop the current list
context to a stack when entering/leaving an "inlined block", we can
support lists inside the caption with expected behavior in this case:

* list [[File:example.jpg|
* list item in image caption ]]
* continuing outer list

It is up to the listener to decide what to do with the link caption.
Since it is fully parsed the listening application must be prepared
for this.  In html output, the caption is rendered inside an 'alt'
text, unless there is a 'frame' or 'thumb' option and no explicit
'alt' option (in which case the caption is completely ignored).  So
the listener should have the ability to toggle rendering of markup on
and off in order to render the caption inside the alt attribute.

/Andreas