[Wikitext-l] Wikitext Madness of the Day: Internal Links

Wed Aug 11 22:13:07 UTC 2010

Andreas Jonsson wrote:
> 
> I have previously stated that the most complex thing in the MediaWiki
> wikitext seemed to be the apostrophes.  I was wrong.
> 
> Links are divided into internal and external.  Here I will write about
> internal links.  Internal links can further be subdivided into regular
> and image-links, image-links complicate things even further, but I
> will not discuss them here.
> 
(...)
> 
> Before the pattern matching, the whole article is split into pieces
> using the '[[' token as delimiter.  A pattern match for all variants
> of internal link patterns (sans the '[[') is attempted at the
> beginning of each piece of text.  The consequence is that the closing
> ']]' will be mathed with the last opening '[[' preceeding it.
> 
> If the pattern match fails, no link is produced; the wikitext is
> outputted and may be processed again to produce other constructions
> later on.
> 
> The fundamental problem with all this is the fact that when looking at
> the wikitext '[[Link|', it cannot easily be determined if this is
> actually a link, or if it is just the text '[[Link|'.  It is not a
> link unless there is a closing ']]' token, that is not preceeded by
> another '[[' token.  The closing token may appear almost anywhere:

(funny links)

That's because after preprocessing, tables are doing before anything
else. This wasn't like this originally, but after some vulnerabilities
based on parsing after tables, they were moved there.

They cannot appear inside of ids or classes because that would be an
illegal id/class name. So it'll be moved into the parameter, then the
Sanitizer stripping them, I suppose.

Don't rely on these things. Consider it unsupported.

> In fact, the current MediaWiki parser does not seem to parse
> links in linear time using linear amount of memory.  My test server
> failed to process a preview of an article consisisting of about 24000
> links on the form [[a]]. It was working hard before it, I
> guess, ran out of memory.  As a comparison it parsed over 38000 italic
> a's, ''a'', without problems.

You can parse them more or less easily because on each '' you can output
a <i> or </i>.

We could make the parser output a link each time it finds one, but that
would require a db query per link (to see if it's blue or red) on the
page. So instead they are added to a list checkid in one big query and
replaced back at the end.
Your OOM will be noticing that listing and the replacements.

> So, what is the reasonable thing to do?  First of all it should be
> pointed out that block elements are not allowed inside link text:
> 
> http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-strict.dtd_a

Lots of things should be made context aware.
'''<div>''Foo</div>''''' will happily give you
<b><div><i>Foo</div></i></b>. You can close an italic *inside a link*.
''Foo [[bar|bar'' baz]] is expected to work (I changed that recently,
but should readd the 'feature').

> This suggests that any sane wikitext should not allow a link to
> continue past the end of the inlined text where it is located.  Even
> better is to say that the sequence [[Link| always opens up a new link
> and that 'end of inline text' will implicitly close the link if it is
> still open.  That will not require any lookahead to parse.  It would
> be consistent with the format parsing to only allow it to run to the
> end of line, though.  Also, currently paragraphs and list elements
> aren't rendered inside link text, unless enclosed or preceeded by a
> table.  So, unless tables inside link text is a widely used feature,
> such a change might not break that many pages.
> 
> /Andreas

I agree. The link text shouldn't span multiple lines. Some page designs
(eg. main pages) could be using it, though.