-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simetrical wrote:
On Thu, Jul 31, 2008 at 3:48 PM, Brion Vibber brion@wikimedia.org wrote:
HTML 4 defines the contents of those elements as CDATA in the DTD, just like <br> and <img> are defined as having no content so there's no ambiguity when they're being interpreted by an HTML parser.
XHTML doesn't provide for that sort of declaration, since XML requires you to be able to parse a document without having a DTD ahead of time.
For compatibility of documents between both HTML and XHTML parsers, XHTML 1.0 recommends using linked resources if possible -- so there's no worry about how to escape contents -- or else using explicit
<![CDATA[...]]> sections in your <script> and <style> elements.
So in fact, a compliant HTML parser *would* parse the contents of
<script> or <style> incorrectly, if it contained entities that were expected to be decoded?
Right... I just did a quick test confirm. This file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr"> <head> <title>Test</title> </head> <body> <script> alert("&"); </script> </body> </html>
will display "&" when served as text/html and "&" when served as application/xhtml+xml.
In that case my fix is wrong, and we should write up a Sanitizer::escapeCdata() and use that here (and elsewhere).
Icky... but perhaps no good way around it I guess. :)
The tricky bit is that for HTML mode you want to wrap the "<![CDATA[" and "]]>" bits in comments (/* blah */) so they don't interfere with the JS or CSS code.
Does that necessarily generally work? Bah, this shouldn't be so annoyingly hard...
- -- brion