I have previously stated that the most complex thing in the MediaWiki wikitext seemed to be the apostrophes. I was wrong.
Links are divided into internal and external. Here I will write about internal links. Internal links can further be subdivided into regular and image-links, image-links complicate things even further, but I will not discuss them here.
In its most basic form a link has the pattern:
prefix '[[' (' '|'\t')* title=(LEGAL_TITLE_CHARS+) (' '|'\t')* ']]' trail
Whether the prefix is part of the link or not is configurable. The rendered html is
<a href="<url>&title=$title"> $prefix$title$trail </a>
The only slight problem here for implementing a parser is that LEGAL_TITLE_CHARS and supporting the prefix is configurable.
The situation becomes funny when considering the case with custom link text:
'[[' <link pattern> '|' (.*?) ']]'
(I have substituted the pattern (' '|'\t')* title=(LEGAL_TITLE_CHARS+) (' '|'\t')* with <link pattern> and removed the prefix and trail for clarity.) The .* is matched non-greedily, so it will be the first instance of ']]' following the opening '[[' that will be matched. (The actual regexp is using (.+?), but using an empty string as link text will still produce a link for some strange reason.)
Before the pattern matching, the whole article is split into pieces using the '[[' token as delimiter. A pattern match for all variants of internal link patterns (sans the '[[') is attempted at the beginning of each piece of text. The consequence is that the closing ']]' will be mathed with the last opening '[[' preceeding it.
If the pattern match fails, no link is produced; the wikitext is outputted and may be processed again to produce other constructions later on.
The fundamental problem with all this is the fact that when looking at the wikitext '[[Link|', it cannot easily be determined if this is actually a link, or if it is just the text '[[Link|'. It is not a link unless there is a closing ']]' token, that is not preceeded by another '[[' token. The closing token may appear almost anywhere:
This is not a link: --------- [[Link|
text. ---------
But this is a link: --------- [[Link|
text.]] ---------
And this: --------- [[Link|
text
* list item]] ----------
The list element will not be rendered as a list. A table will, however be rendered as a table: -------- [[Link|
{| | col1 |}
text]] --------
Do you need to put table inside a definition term? No problem: -------- ;[[l| {| | col1 |}]] --------
This is an especially funny example: --------- [[Link| {| | col1 |- id="row2" class=']]' |} -------- The rendered link will be left unterminated.
Links cannot be opened inside table parameters though. This is not a link: -------- {| id="[[Link|" | text]] |} --------
Even though the table parameters usually swallows any junk: -------- {| id="table1" class="[[Link]]" | No link above. |} --------
column parameters are different: -------- {| | class="[[Link]]" | <-- will not actually become column parameters | class="[[Link" | <-- still not column parameters, but this time no link either | class="[[Link" | but of course, if there happen to be a | ']]'-token somewhere later in the document, its a whole different matter. |} --------
Trying to reproduce this behavior in a new parser would, of course, be insane. In fact, the current MediaWiki parser does not seem to parse links in linear time using linear amount of memory. My test server failed to process a preview of an article consisisting of about 24000 links on the form [[a]]. It was working hard before it, I guess, ran out of memory. As a comparison it parsed over 38000 italic a's, ''a'', without problems.
So, what is the reasonable thing to do? First of all it should be pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-stri...
This suggests that any sane wikitext should not allow a link to continue past the end of the inlined text where it is located. Even better is to say that the sequence [[Link| always opens up a new link and that 'end of inline text' will implicitly close the link if it is still open. That will not require any lookahead to parse. It would be consistent with the format parsing to only allow it to run to the end of line, though. Also, currently paragraphs and list elements aren't rendered inside link text, unless enclosed or preceeded by a table. So, unless tables inside link text is a widely used feature, such a change might not break that many pages.
/Andreas
Andreas Jonsson wrote:
I have previously stated that the most complex thing in the MediaWiki wikitext seemed to be the apostrophes. I was wrong.
Links are divided into internal and external. Here I will write about internal links. Internal links can further be subdivided into regular and image-links, image-links complicate things even further, but I will not discuss them here.
(...)
Before the pattern matching, the whole article is split into pieces using the '[[' token as delimiter. A pattern match for all variants of internal link patterns (sans the '[[') is attempted at the beginning of each piece of text. The consequence is that the closing ']]' will be mathed with the last opening '[[' preceeding it.
If the pattern match fails, no link is produced; the wikitext is outputted and may be processed again to produce other constructions later on.
The fundamental problem with all this is the fact that when looking at the wikitext '[[Link|', it cannot easily be determined if this is actually a link, or if it is just the text '[[Link|'. It is not a link unless there is a closing ']]' token, that is not preceeded by another '[[' token. The closing token may appear almost anywhere:
(funny links)
That's because after preprocessing, tables are doing before anything else. This wasn't like this originally, but after some vulnerabilities based on parsing after tables, they were moved there.
They cannot appear inside of ids or classes because that would be an illegal id/class name. So it'll be moved into the parameter, then the Sanitizer stripping them, I suppose.
Don't rely on these things. Consider it unsupported.
In fact, the current MediaWiki parser does not seem to parse links in linear time using linear amount of memory. My test server failed to process a preview of an article consisisting of about 24000 links on the form [[a]]. It was working hard before it, I guess, ran out of memory. As a comparison it parsed over 38000 italic a's, ''a'', without problems.
You can parse them more or less easily because on each '' you can output a <i> or </i>.
We could make the parser output a link each time it finds one, but that would require a db query per link (to see if it's blue or red) on the page. So instead they are added to a list checkid in one big query and replaced back at the end. Your OOM will be noticing that listing and the replacements.
So, what is the reasonable thing to do? First of all it should be pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-stri...
Lots of things should be made context aware. '''<div>''Foo</div>''''' will happily give you <b><div><i>Foo</div></i></b>. You can close an italic *inside a link*. ''Foo [[bar|bar'' baz]] is expected to work (I changed that recently, but should readd the 'feature').
This suggests that any sane wikitext should not allow a link to continue past the end of the inlined text where it is located. Even better is to say that the sequence [[Link| always opens up a new link and that 'end of inline text' will implicitly close the link if it is still open. That will not require any lookahead to parse. It would be consistent with the format parsing to only allow it to run to the end of line, though. Also, currently paragraphs and list elements aren't rendered inside link text, unless enclosed or preceeded by a table. So, unless tables inside link text is a widely used feature, such a change might not break that many pages.
/Andreas
I agree. The link text shouldn't span multiple lines. Some page designs (eg. main pages) could be using it, though.
2010-08-12 00:13, Platonides skrev:
Andreas Jonsson wrote:
I have previously stated that the most complex thing in the MediaWiki wikitext seemed to be the apostrophes. I was wrong.
Links are divided into internal and external. Here I will write about internal links. Internal links can further be subdivided into regular and image-links, image-links complicate things even further, but I will not discuss them here.
(...)
Before the pattern matching, the whole article is split into pieces using the '[[' token as delimiter. A pattern match for all variants of internal link patterns (sans the '[[') is attempted at the beginning of each piece of text. The consequence is that the closing ']]' will be mathed with the last opening '[[' preceeding it.
If the pattern match fails, no link is produced; the wikitext is outputted and may be processed again to produce other constructions later on.
The fundamental problem with all this is the fact that when looking at the wikitext '[[Link|', it cannot easily be determined if this is actually a link, or if it is just the text '[[Link|'. It is not a link unless there is a closing ']]' token, that is not preceeded by another '[[' token. The closing token may appear almost anywhere:
(funny links)
That's because after preprocessing, tables are doing before anything else. This wasn't like this originally, but after some vulnerabilities based on parsing after tables, they were moved there.
They cannot appear inside of ids or classes because that would be an illegal id/class name. So it'll be moved into the parameter, then the Sanitizer stripping them, I suppose.
The Sanitizer allows both "]]" and "</a>" as class names.
Don't rely on these things. Consider it unsupported.
In fact, the current MediaWiki parser does not seem to parse links in linear time using linear amount of memory. My test server failed to process a preview of an article consisisting of about 24000 links on the form [[a]]. It was working hard before it, I guess, ran out of memory. As a comparison it parsed over 38000 italic a's, ''a'', without problems.
You can parse them more or less easily because on each '' you can output a<i> or</i>.
We could make the parser output a link each time it finds one, but that would require a db query per link (to see if it's blue or red) on the page. So instead they are added to a list checkid in one big query and replaced back at the end. Your OOM will be noticing that listing and the replacements.
I don't think that this really explains how the server can run out of memory on such a small input (less than 150kB). The splitting of the input and the regexp should run in linear time as far as I can tell, but unless there is a very large constant overhead per link, I believe that something is consuming a super linear amount of memory. An n log n step somewhere might explain it, though.
So, what is the reasonable thing to do? First of all it should be pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-stri...
Lots of things should be made context aware. '''<div>''Foo</div>''''' will happily give you <b><div><i>Foo</div></i></b>. You can close an italic *inside a link*. ''Foo [[bar|bar'' baz]] is expected to work (I changed that recently, but should readd the 'feature').
I'm aware of this, and it is not really a problem to handle.
This suggests that any sane wikitext should not allow a link to continue past the end of the inlined text where it is located. Even better is to say that the sequence [[Link| always opens up a new link and that 'end of inline text' will implicitly close the link if it is still open. That will not require any lookahead to parse. It would be consistent with the format parsing to only allow it to run to the end of line, though. Also, currently paragraphs and list elements aren't rendered inside link text, unless enclosed or preceeded by a table. So, unless tables inside link text is a widely used feature, such a change might not break that many pages.
/Andreas
I agree. The link text shouldn't span multiple lines. Some page designs (eg. main pages) could be using it, though.
I can think of a number of variations with increasing implementation complexity:
* [[Link| opens a link, unless a link is already opened, end of line closes unterminated link text.
* [[Link| opens a link, unless a link is already opened, end of inline text closes unterminated link text.
* [[Link| opens a link, unless a link is already opened, start of a non-paragraph block elements closes the link.
* [[Link| opens a link, unless a link is already opened, list elements are disabled in the lexer, non-paragraph block elements closes the link.
* [[Link| opens a link, unless a link is already opened, end of article closes unterminated link text. The closing token ']]' may only appear inside inline text.
* [[Link| opens a link, unless there is another [[Link| not preceeded by a ]] on the same line. End of line closes unterminated link text.
* [[Link| opens a link, unless there is another [[Link| not preceeded by a ]] in the same inlined text. End of inline text closes unterminated link text.
* etc.
As long as the link opening and closing tokens are only allowed inside inlined text, it can be implemented.
However, requiring a link to be properly closed in order to be a link is fairly complex. What should the parser should do with the link title, if it desides that it is not really a link title after all? It may contain tokens. Thus, the lexer must use lookahead and not produce any spurious link open tokens. To avoid the n^2 worst case, a full extra pass to compute hints would be necessary before doing the actual lexing.
/Andreas
2010-08-12 09:30, Andreas Jonsson skrev: [...]
However, requiring a link to be properly closed in order to be a link is fairly complex. What should the parser should do with the link title, if it desides that it is not really a link title after all? It may contain tokens. Thus, the lexer must use lookahead and not produce any spurious link open tokens. To avoid the n^2 worst case, a full extra pass to compute hints would be necessary before doing the actual lexing.
Replying to myself. I might be wrong about the complexity of finding the closing token. The below lexer hack may actually do the trick: a rule that matches the empty string if there is a valid closing tag ahead. Since it does not search past '[[' tokens, no content will be scanned more than once by this rule. So the worst case running time is still linear.
fragment LINK_CLOSE_LOOKAHEAD @init{ bool success = false; }: ( ( /* * List of all other lexer rules that may contain the strings * ']]' or '[['. */ BEGIN_TABLE | TABLE_ROW_SEPARATOR | TABLE_CELL | TABLE_CELL_INLINE /* * Alternative: don't search beyond other block elements: */ // ({BOL}?=> '{|')=> '{|' {false}?=> // | (LIST_ELEMENT)=> LIST_ELEMENT {false}?=> // | (NEWLINE NEWLINE)=> NEWLINE NEWLINE {false}?=> /* * Otherwise, anything goes except ']]' or '[['. */ | ~('['|']') | {!PEEK(2, '[')}?=> '[' | {!PEEK(2, ']')}?=> ']' )+ ( ']]' {(success = true), false}?=> | {false}?=> ) ) | {success}?=> ;
2010-08-12 07:13, Tomaž Šolc skrev:
Hi
(The actual regexp is using (.+?), but using an empty string as link text will still produce a link for some strange reason.)
Empty string is actually a special case: such links will use the name of the target page minus the namespace prefix for the anchor.
I see. But I can't find the code that handles this case. The regexp is '^([{$tc}]+)(?:\|(.+?))?]]', so if there is a '|' in the link, there must be at least one character following it before the ']]' in order to match.
/Andreas
Tomaž Šolc schrieb:
Hi
(The actual regexp is using (.+?), but using an empty string as link text will still produce a link for some strange reason.)
Empty string is actually a special case: such links will use the name of the target page minus the namespace prefix for the anchor.
but that's done on save, i guess before the actual parsing.
Pipe trick syntax: [[User:Foo|]] will turn into [[User:Foo|Foo]] upon save.
Yay arcane special cases.
-- daniel
2010-08-12 09:40, Daniel Kinzler skrev:
Tomaž Šolc schrieb:
Hi
(The actual regexp is using (.+?), but using an empty string as link text will still produce a link for some strange reason.)
Empty string is actually a special case: such links will use the name of the target page minus the namespace prefix for the anchor.
but that's done on save, i guess before the actual parsing.
Pipe trick syntax: [[User:Foo|]] will turn into [[User:Foo|Foo]] upon save.
OK, that explains it.
Yay arcane special cases.
Yay, utomatically fixing the user input. That'll make things a lot easier. :-)
/Andreas
Andreas Jonsson wrote:
... Trying to reproduce this behavior in a new parser would, of course, be insane. In fact, the current MediaWiki parser does not seem to parse links in linear time using linear amount of memory. My test server failed to process a preview of an article consisisting of about 24000 links on the form [[a]]. It was working hard before it, I guess, ran out of memory. As a comparison it parsed over 38000 italic a's, ''a'', without problems.
So, what is the reasonable thing to do? First of all it should be pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-stri...
This suggests that any sane wikitext should not allow a link to continue past the end of the inlined text where it is located. Even better is to say that the sequence [[Link| always opens up a new link and that 'end of inline text' will implicitly close the link if it is still open. That will not require any lookahead to parse. It would be consistent with the format parsing to only allow it to run to the end of line, though. Also, currently paragraphs and list elements aren't rendered inside link text, unless enclosed or preceeded by a table. So, unless tables inside link text is a widely used feature, such a change might not break that many pages.
/Andreas
Keep in mind that MediaWiki is switching to html5. As the browsers don't even parse according to xhtml rules, and the xhtml doctype means nothing but a hint to validators (which not every page even validates properly anyways) which aren't essential, I don't believe xhtml rules -- with the exception of valid xml output -- are valid if they are retracted by html5 (which attempts to define html parsing how it should be, based on how it already is, iirc). In this case, html5 defines <a> as "transparent content", block elements are valid inside of an <a> if they are valid without the <a> there. So as long as you don't output the <p>, as you would do anyways if you got the <div> directly, then <a ...><div>...</div></a> is valid.
Just making note...
2010-08-28 14:32, Daniel Friesen skrev:
Andreas Jonsson wrote:
... Trying to reproduce this behavior in a new parser would, of course, be insane. In fact, the current MediaWiki parser does not seem to parse links in linear time using linear amount of memory. My test server failed to process a preview of an article consisisting of about 24000 links on the form [[a]]. It was working hard before it, I guess, ran out of memory. As a comparison it parsed over 38000 italic a's, ''a'', without problems.
So, what is the reasonable thing to do? First of all it should be pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-stri...
This suggests that any sane wikitext should not allow a link to continue past the end of the inlined text where it is located. Even better is to say that the sequence [[Link| always opens up a new link and that 'end of inline text' will implicitly close the link if it is still open. That will not require any lookahead to parse. It would be consistent with the format parsing to only allow it to run to the end of line, though. Also, currently paragraphs and list elements aren't rendered inside link text, unless enclosed or preceeded by a table. So, unless tables inside link text is a widely used feature, such a change might not break that many pages.
/Andreas
Keep in mind that MediaWiki is switching to html5. As the browsers don't even parse according to xhtml rules, and the xhtml doctype means nothing but a hint to validators (which not every page even validates properly anyways) which aren't essential, I don't believe xhtml rules -- with the exception of valid xml output -- are valid if they are retracted by html5 (which attempts to define html parsing how it should be, based on how it already is, iirc). In this case, html5 defines<a> as "transparent content", block elements are valid inside of an<a> if they are valid without the<a> there. So as long as you don't output the<p>, as you would do anyways if you got the<div> directly, then<a ...><div>...</div></a> is valid.
Just making note...
That's very interesting. I didn't know that.
/Andreas
wikitext-l@lists.wikimedia.org