Re: [Wikitext-l] HTML security

22 Nov 2007


      Steve Bennett wrote:
...
MediaWiki makes a general contract that it won't allow "dangerous"
HTML tags in its output. It does this by making a final parse fairly
late in the process to clean HTML tag attributes, and to escape any
tags it doesn't like, and unrecognised &entities;.
Question is: should the parser attempt to do this, or assume the
existence of that function?
For example, in this code>
<pre>
preformatted text with <nasty><html><characters> and &entities;
</pre>
Should it just treat the string as valid, passing it out literally
(and letting the security code go to work), or should it keep parsing
characters, stripping them, and attempting to reproduce all the work
that is currently done?
Would the developers (or users, for that matter) be likely to trust a
pure parser solution? It seems to me that it's a lot easier simply to
scan the resulting output looking for bad bits, than it is to attempt
to predict and block off all the possible routes to producing nasty
code.
On the downside, if the HTML-stripping logic isn't present in the
grammar, then it doesn't exist in any non-PHP implementations...
What do people think?
Steve
The grammar doesn't have HTML-stripping. Everything is stripped. Those 
html tags are just also valid wikitext tags.
So the <pre> handler is called with content "preformatted text with 
<nasty><html><characters> and &entities;" to do with it whatever fits. 
He can then output html code, that literal text (escaped) or recursively 
call the parser to reparse it. Core html tags should be as similar as 
possible to extension tags. However, that limits parser guessing and 
some current tricks with invalid nesting. So maybe only enforce it on 
block-level tags...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] HTML security