As devotees of web standards are aware, HTML5 is no longer an XML
variant (nor is it SGML).
This occasionally leads to fun times in Visual Editor and Parsoid
land, as we try to work around various browser incompatibilities to
ensure that documents are parsed consistently. Parsoid uses an HTML5
parser, but it uses its own non-HTML5-spec serializer (ie, not
document.body.outerHTML) in order to emit XML-compatible documents
that work around certain browser bugs (and use intelligent quoting to
reduce document size). Visual Editor tries to parse parsoid output
using the browser's XML serializer due to bugs in Internet Explorer (I
believe) and then fixes up the output to match the HTML5 parser spec
for <pre> tags. I'm not sure exactly how Visual Editor serializes its
documents to send them back to Parsoid. I bet it's not quite the same
way Parsoid serializes them.
In any case, I filed bugs with the W3C months ago to try to fix some
of the specs. In particular, there is no official spec algorithm for
serializing an HTML document as XML. That may now be fixed! See
https://www.w3.org/Bugs/Public/show_bug.cgi?id=13410 (start at comment
13 if you are impatient).
It would probably be worth auditing VE and Parsoid's serialization
algorithms to ensure that they are compatible with the new draft
standard (
http://www.w3.org/TR/DOM-Parsing/#dfn-concept-xml-serialization-algorithm
), so that we can suggest improvements if we've got interesting corner
cases and weird hacks that turn out to be needed for interoperability
in the real world.
(And see also
https://www.w3.org/Bugs/Public/show_bug.cgi?id=25225 --
it turns out that not even the HTML serializer API is completely
defined in the spec, although `outerHTML` provides a means to get at
the HTML fragment serializer. We had some issues with disappearing
whitespace in the outer contexts of HTML documents as a result.)
--scott
--
(
http://cscott.net)