Re: [Wikitext-l] Wikitext Madness of the Day: Internal Links

12 Aug 2010

2010-08-12 00:13, Platonides skrev:
...
  Andreas Jonsson wrote:

  I have previously stated that the most complex
thing in the MediaWiki
 wikitext seemed to be the apostrophes.  I was wrong.

 Links are divided into internal and external.  Here I will write about
 internal links.  Internal links can further be subdivided into regular
 and image-links, image-links complicate things even further, but I
 will not discuss them here.

        (...)

  Before the pattern matching, the whole article is
split into pieces
 using the '[[' token as delimiter.  A pattern match for all variants
 of internal link patterns (sans the '[[') is attempted at the
 beginning of each piece of text.  The consequence is that the closing
 ']]' will be mathed with the last opening '[[' preceeding it.

 If the pattern match fails, no link is produced; the wikitext is
 outputted and may be processed again to produce other constructions
 later on.

 The fundamental problem with all this is the fact that when looking at
 the wikitext '[[Link|', it cannot easily be determined if this is
 actually a link, or if it is just the text '[[Link|'.  It is not a
 link unless there is a closing ']]' token, that is not preceeded by
 another '[[' token.  The closing token may appear almost anywhere:
        (funny links)

 That's because after preprocessing, tables are doing before anything
 else. This wasn't like this originally, but after some vulnerabilities
 based on parsing after tables, they were moved there.

 They cannot appear inside of ids or classes because that would be an
 illegal id/class name. So it'll be moved into the parameter, then the
 Sanitizer stripping them, I suppose.
     The Sanitizer allows both "]]" and "</a>" as class
names.

...
  Don't rely on these things. Consider it
unsupported.

  In fact, the current MediaWiki parser does not
seem to parse
 links in linear time using linear amount of memory.  My test server
 failed to process a preview of an article consisisting of about 24000
 links on the form [[a]]. It was working hard before it, I
 guess, ran out of memory.  As a comparison it parsed over 38000 italic
 a's, ''a'', without problems.
        You can parse them more or less easily because on each '' you can
output
 a<i>  or</i>.

 We could make the parser output a link each time it finds one, but that
 would require a db query per link (to see if it's blue or red) on the
 page. So instead they are added to a list checkid in one big query and
 replaced back at the end.
 Your OOM will be noticing that listing and the replacements.

I don't think that this really explains how the server can run out of 
memory on
such a small input (less than 150kB).  The splitting of the input and 
the regexp should run in
linear time as far as I can tell, but unless there is a very large 
constant overhead
per link, I believe that something is consuming a super linear amount of 
memory.
An n log n step somewhere might explain it, though.

...
   So, what is
the reasonable thing to do?  First of all it should be
 pointed out that block elements are not allowed inside link text:

 http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-str…
        Lots of things should be made context aware.
 '''<div>''Foo</div>''''' will happily
give you
 <b><div><i>Foo</div></i></b>. You can close an italic
*inside a link*.
 ''Foo [[bar|bar'' baz]] is expected to work (I changed that recently,
 but should readd the 'feature').

I'm aware of this, and it is not really a problem to handle.

...
   This suggests
that any sane wikitext should not allow a link to
 continue past the end of the inlined text where it is located.  Even
 better is to say that the sequence [[Link| always opens up a new link
 and that 'end of inline text' will implicitly close the link if it is
 still open.  That will not require any lookahead to parse.  It would
 be consistent with the format parsing to only allow it to run to the
 end of line, though.  Also, currently paragraphs and list elements
 aren't rendered inside link text, unless enclosed or preceeded by a
 table.  So, unless tables inside link text is a widely used feature,
 such a change might not break that many pages.

 /Andreas
        I agree. The link text shouldn't span multiple lines. Some page designs
 (eg. main pages) could be using it, though.
     I can think of a number of variations with increasing implementation
complexity:

* [[Link| opens a link, unless a link is already opened, end of line
   closes unterminated link text.

* [[Link| opens a link, unless a link is already opened, end of inline
   text closes unterminated link text.

* [[Link| opens a link, unless a link is already opened, start of a
   non-paragraph block elements closes the link.

* [[Link| opens a link, unless a link is already opened, list elements
   are disabled in the lexer, non-paragraph block elements closes the
   link.

* [[Link| opens a link, unless a link is already opened, end of
   article closes unterminated link text.  The closing token ']]' may
   only appear inside inline text.

* [[Link| opens a link, unless there is another [[Link| not preceeded
   by a ]] on the same line.  End of line closes unterminated link
   text.

* [[Link| opens a link, unless there is another [[Link| not preceeded
   by a ]] in the same inlined text.  End of inline text closes
   unterminated link text.

* etc.

As long as the link opening and closing tokens are only allowed inside
inlined text, it can be implemented.

However, requiring a link to be properly closed in order to be a link
is fairly complex.  What should the parser should do with the link
title, if it desides that it is not really a link title after all?  It
may contain tokens.  Thus, the lexer must use lookahead and not
produce any spurious link open tokens.  To avoid the n^2 worst case, a
full extra pass to compute hints would be necessary before doing the
actual lexing.

/Andreas

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Wikitext Madness of the Day: Internal Links