Wikitext Madness of the Day: Internal Links - Wikitext-l

12 Aug 2010


      I have previously stated that the most complex thing in the MediaWiki
wikitext seemed to be the apostrophes.  I was wrong.
Links are divided into internal and external.  Here I will write about
internal links.  Internal links can further be subdivided into regular
and image-links, image-links complicate things even further, but I
will not discuss them here.
In its most basic form a link has the pattern:
prefix '[[' (' '|'\t')* title=(LEGAL_TITLE_CHARS+) (' '|'\t')* ']]' trail
Whether the prefix is part of the link or not is configurable.  The
rendered html is
<a href="<url>&title=$title"> $prefix$title$trail </a>
The only slight problem here for implementing a parser is that
LEGAL_TITLE_CHARS and supporting the prefix is configurable.
The situation becomes funny when considering the case with custom link
text:
'[[' <link pattern> '|' (.*?) ']]'
(I have substituted the pattern (' '|'\t')* title=(LEGAL_TITLE_CHARS+)
(' '|'\t')* with <link pattern> and removed the prefix and trail for
clarity.)  The .* is matched non-greedily, so it will be the first
instance of ']]' following the opening '[[' that will be matched.
(The actual regexp is using (.+?), but using an empty string as link
text will still produce a link for some strange reason.)
Before the pattern matching, the whole article is split into pieces
using the '[[' token as delimiter.  A pattern match for all variants
of internal link patterns (sans the '[[') is attempted at the
beginning of each piece of text.  The consequence is that the closing
']]' will be mathed with the last opening '[[' preceeding it.
If the pattern match fails, no link is produced; the wikitext is
outputted and may be processed again to produce other constructions
later on.
The fundamental problem with all this is the fact that when looking at
the wikitext '[[Link|', it cannot easily be determined if this is
actually a link, or if it is just the text '[[Link|'.  It is not a
link unless there is a closing ']]' token, that is not preceeded by
another '[[' token.  The closing token may appear almost anywhere:
This is not a link:
---------
[[Link|
text.
---------
But this is a link:
---------
[[Link|
text.]]
---------
And this:
---------
[[Link|
text
* list item]]
----------
The list element will not be rendered as a list.  A table will,
however be rendered as a table:
--------
[[Link|
{|
| col1
|}
text]]
--------
Do you need to put table inside a definition term?  No problem:
--------
;[[l|
{|
| col1
|}]]
--------
This is an especially funny example:
---------
[[Link|
{|
| col1
|- id="row2" class=']]'
|}
--------
The rendered link will be left unterminated.
Links cannot be opened inside table parameters though.  This is not a link:
--------
{| id="[[Link|"
| text]]
|}
--------
Even though the table parameters usually swallows any junk:
--------
{| id="table1" class="[[Link]]"
| No link above.
|}
--------
column parameters are different:
--------
{|
| class="[[Link]]" | <-- will not actually become column parameters
| class="[[Link" | <-- still not column parameters, but this time no 
link either
| class="[[Link" | but of course, if there happen to be a
| ']]'-token somewhere later in the document, its a whole different matter.
|}
--------
Trying to reproduce this behavior in a new parser would, of course, be
insane.  In fact, the current MediaWiki parser does not seem to parse
links in linear time using linear amount of memory.  My test server
failed to process a preview of an article consisisting of about 24000
links on the form [[a]]. It was working hard before it, I
guess, ran out of memory.  As a comparison it parsed over 38000 italic
a's, ''a'', without problems.
So, what is the reasonable thing to do?  First of all it should be
pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-stri...
This suggests that any sane wikitext should not allow a link to
continue past the end of the inlined text where it is located.  Even
better is to say that the sequence [[Link| always opens up a new link
and that 'end of inline text' will implicitly close the link if it is
still open.  That will not require any lookahead to parse.  It would
be consistent with the format parsing to only allow it to run to the
end of line, though.  Also, currently paragraphs and list elements
aren't rendered inside link text, unless enclosed or preceeded by a
table.  So, unless tables inside link text is a widely used feature,
such a change might not break that many pages.
/Andreas