MediaWiki makes a general contract that it won't allow "dangerous" HTML tags in its output. It does this by making a final parse fairly late in the process to clean HTML tag attributes, and to escape any tags it doesn't like, and unrecognised &entities;.
Question is: should the parser attempt to do this, or assume the existence of that function?
For example, in this code>
<pre> preformatted text with <nasty><html><characters> and &entities; </pre>
Should it just treat the string as valid, passing it out literally (and letting the security code go to work), or should it keep parsing characters, stripping them, and attempting to reproduce all the work that is currently done?
Would the developers (or users, for that matter) be likely to trust a pure parser solution? It seems to me that it's a lot easier simply to scan the resulting output looking for bad bits, than it is to attempt to predict and block off all the possible routes to producing nasty code.
On the downside, if the HTML-stripping logic isn't present in the grammar, then it doesn't exist in any non-PHP implementations...
What do people think?
Steve
On Thu, Nov 22, 2007 at 02:34:27PM +1100, Steve Bennett wrote:
Would the developers (or users, for that matter) be likely to trust a pure parser solution? It seems to me that it's a lot easier simply to scan the resulting output looking for bad bits, than it is to attempt to predict and block off all the possible routes to producing nasty code.
My opinion is that each block of code should do it's think, and no one else's thing. DJB's a whackjob, but on this point, he hews correctly to those who created this OS we pray to daily...
Cheers, -- jra
On 11/22/07, Jay R. Ashworth jra@baylink.com wrote:
My opinion is that each block of code should do it's think, and no one else's thing. DJB's a whackjob, but on this point, he hews correctly to those who created this OS we pray to daily...
That doesn't help. Is parsing &foo; a parser "thing" or a clean/tidy/secure HTML "thing"?
Steve
On Thu, Nov 22, 2007 at 04:35:56PM +1100, Steve Bennett wrote:
On 11/22/07, Jay R. Ashworth jra@baylink.com wrote:
My opinion is that each block of code should do it's think, and no one else's thing. DJB's a whackjob, but on this point, he hews correctly to those who created this OS we pray to daily...
That doesn't help. Is parsing &foo; a parser "thing" or a clean/tidy/secure HTML "thing"?
That depends on what you're parsing it for.
If you're parsing it to decide to drop it because you think it's unsafe, I would say that a post-parser tidy pass should do it.
Cheers, -- jra
Steve Bennett wrote:
MediaWiki makes a general contract that it won't allow "dangerous" HTML tags in its output. It does this by making a final parse fairly late in the process to clean HTML tag attributes, and to escape any tags it doesn't like, and unrecognised &entities;.
Question is: should the parser attempt to do this, or assume the existence of that function?
For example, in this code>
<pre> preformatted text with <nasty><html><characters> and &entities; </pre>
Should it just treat the string as valid, passing it out literally (and letting the security code go to work), or should it keep parsing characters, stripping them, and attempting to reproduce all the work that is currently done?
Would the developers (or users, for that matter) be likely to trust a pure parser solution? It seems to me that it's a lot easier simply to scan the resulting output looking for bad bits, than it is to attempt to predict and block off all the possible routes to producing nasty code.
On the downside, if the HTML-stripping logic isn't present in the grammar, then it doesn't exist in any non-PHP implementations...
What do people think?
Steve
The grammar doesn't have HTML-stripping. Everything is stripped. Those html tags are just also valid wikitext tags. So the <pre> handler is called with content "preformatted text with <nasty><html><characters> and &entities;" to do with it whatever fits. He can then output html code, that literal text (escaped) or recursively call the parser to reparse it. Core html tags should be as similar as possible to extension tags. However, that limits parser guessing and some current tricks with invalid nesting. So maybe only enforce it on block-level tags...
wikitext-l@lists.wikimedia.org