Even an
annotated HTML DOM (using the data-* attributes for example)
could be used. We might actually be able to off-load most
context-sensitive parts of the parsing process to the browser's HTML
parser by feeding it pre-tokenized HTML tag soup, for example via
.innerHTML.
I'm not sure what you are proposing -- are you suggesting that we let
some anomalies persist and let the browser take care of it?
Yes and no ;) I was speculating on the possibility of using the built-in
HTML5 parser of modern browsers to implement part of our in-browser
parsing pipeline, especially for the visual editor. When we feed tag
soup produced by a CFG-based tokenizer to a modern browser (e.g., FF4+)
with an HTML5 parser using .innerHTML, it will sanitize the input
according to the HTML5 parser spec. If we then read the .innerHTML back,
we'll get a sanitized serialization (see example at end). But we could
just use the cleaned-up DOM fragment of course, and walk that and turn
it into WikiDom.
This is just an idea at this stage, and there might be more issues that
sink it. Especially the preservation of overlaps in annotations might be
tricky. HTML5 parsers break overlapping ranges up into non-overlapping
ones, so they would need to be merged back together when building the
WikiDom. Alternatively, there are Javascript libraries implementing the
HTML5 parser spec which can be modified if plain HTML5 behavior is not
ideal.
IMO we should be shooting for server APIs that give
users very clean
data structures, so they can transform them however they like. HTML
should be just one of the output formats.
I completely agree- on the server side, higher-level parsing into a
suitable tree (DOM or else) or corresponding SAX events would be
performed by a (possibly modified) HTML5 parser. The output of this
parser is in no way limited to HTML.
Gabriel
Example in FF 4+:
>> document.body.innerHTML = "<b
data-x='y'>bb<i>bbii</b>ii</i>"
"<b
data-x='y'>bb<i>bbii</b>ii</i>"
>> document.body.innerHTML
"<b
data-x="y">bb<i>bbii</i></b><i>ii</i>"