Stealing error recovery from HTML5 parsers

List overview All Threads
Download

newer

older

Help us test the VisualEditor...

nested definition lists

Gabriel Wicke

11 Nov 2011 11 Nov '11

6:57 p.m.

Good evening,

this week I looked at different ways of cajoling overlapping, improperly nested or otherwise horrible but real-life wiki content into the WikiDom structure for consumption by the visual editor currently in development. So far, MediaWiki delegates the sanitization of those horrors to html tidy, which employs (mostly) good heuristics to make sense of its input.

The [HTML5] spec finally standardized parsing and error recovery for HTML, which seems to overlap widely with what we need for the new parser (how far?). Open-source reference implementations of the parser spec are available in Java [VNU] that compiles to C++ and Javascript (http://livedom.validator.nu/) through GWT, and PHP and Python ports at [HLib]. Modern browsers have similar implementations built in.

The reference parsers all use a relatively simple tokenizer in combination with a mostly switch-based parser / tree builder that constructs a cleaned-up DOM from the token stream. Tags are balanced and matched using a random-access stack, with a separate list of open formatting elements (very similar to the annotations in WikiDom). For each parsing context and token combination an error recovery strategy can be directly specified in a switch case.

The strength of this strategy is clearly the ease of implementing error recovery. The big disadvantage is the absence of a nicely declarative grammar, except perhaps a shallow one for the tokenizer. (Is there actually an example of a parser with serious HTML-like error recovery and an elegant grammar?)

In our specific visual editor application, performing a full error recovery / clean-up while constructing the WikiDom is at odds with the desire to round-trip wiki source. Performing full sanitation only in the HTML serializer while doing none in the Wikitext serializer seems to be a better fit. The WikiDom design with its support for overlapping annotations allows the omission of most early sanitation for inline elements. Block-level constructs however still need to be fully parsed so that implicit scopes of inline elements can be determined (e.g., limiting the range of annotations to table cells) and a DOM tree can be built. This tree then allows the visual editor to present some sensible, editable outline of the document.

A possible implementation could use a simplified version of the current PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a token stream to a parser / tree builder modeled on the HTML5 parsers. Separating the sanitation of inline and block-level elements to minimize early sanitation seems to be quite doable.

What do you think about this general direction of building on HTML parsers? Where should a wiki parser differ in its error recovery strategy? How important is having a full grammar?

Gabriel

[HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing [VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/ Live JS parser demo: http://livedom.validator.nu/ [HLib] PHP and Python parsers: http://code.google.com/p/html5lib/

Show replies by date

Neil Kandalgaonkar

12 Nov 12 Nov

1:58 a.m.

+1 on doing HTML and Wikitext in the same parser, only because I've found that it is necessary, in my limited experience doing it in JS.

I'm not knowledgeable enough about the HTML5 error recovery spec to comment. I don't know of any other models for "recovery" in parsers out there, other than our own. I don't know how you would find out if the HTML5 way is appropriate for us other than trying it. Since it seems to point the way towards a more understandable means of normalizing wikitext, I would vote for it, but I'm voting from a position of relative ignorance.

Should we have a formal grammar? Let's be pragmatic -- a formal grammar is a means to a couple of ends as far as I see it.

1 - to easily have equivalent parsers in PHP and JS, and to allow the community to help develop it in an interactive way a la ParserPlayground.

This is not an either-or thing. If the parser is MOSTLY formal, that's good enough. But we should still be shooting for like 97% of the cases to be handled by the grammar.

2 - to give others a way to parse wikitext better.

This may not be necessary. If our parser can produce a nice abstract syntax tree at some point, the API can just emit some other regular format for people to use, perhaps XML or JSON based. Wikidom is more optimized for the editor, but it's probably also good for this purpose.

Then *maybe* one day we can transition to this more regular format, but that's a decision we'll probably face in 2013, if ever.

On 11/11/11 3:57 PM, Gabriel Wicke wrote:

...

Good evening,

this week I looked at different ways of cajoling overlapping, improperly nested or otherwise horrible but real-life wiki content into the WikiDom structure for consumption by the visual editor currently in development. So far, MediaWiki delegates the sanitization of those horrors to html tidy, which employs (mostly) good heuristics to make sense of its input.

The [HTML5] spec finally standardized parsing and error recovery for HTML, which seems to overlap widely with what we need for the new parser (how far?). Open-source reference implementations of the parser spec are available in Java [VNU] that compiles to C++ and Javascript (http://livedom.validator.nu/) through GWT, and PHP and Python ports at [HLib]. Modern browsers have similar implementations built in.

The reference parsers all use a relatively simple tokenizer in combination with a mostly switch-based parser / tree builder that constructs a cleaned-up DOM from the token stream. Tags are balanced and matched using a random-access stack, with a separate list of open formatting elements (very similar to the annotations in WikiDom). For each parsing context and token combination an error recovery strategy can be directly specified in a switch case.

The strength of this strategy is clearly the ease of implementing error recovery. The big disadvantage is the absence of a nicely declarative grammar, except perhaps a shallow one for the tokenizer. (Is there actually an example of a parser with serious HTML-like error recovery and an elegant grammar?)

In our specific visual editor application, performing a full error recovery / clean-up while constructing the WikiDom is at odds with the desire to round-trip wiki source. Performing full sanitation only in the HTML serializer while doing none in the Wikitext serializer seems to be a better fit. The WikiDom design with its support for overlapping annotations allows the omission of most early sanitation for inline elements. Block-level constructs however still need to be fully parsed so that implicit scopes of inline elements can be determined (e.g., limiting the range of annotations to table cells) and a DOM tree can be built. This tree then allows the visual editor to present some sensible, editable outline of the document.

A possible implementation could use a simplified version of the current PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a token stream to a parser / tree builder modeled on the HTML5 parsers. Separating the sanitation of inline and block-level elements to minimize early sanitation seems to be quite doable.

What do you think about this general direction of building on HTML parsers? Where should a wiki parser differ in its error recovery strategy? How important is having a full grammar?

Gabriel

[HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing [VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/ Live JS parser demo: http://livedom.validator.nu/ [HLib] PHP and Python parsers: http://code.google.com/p/html5lib/

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar ( ) neilk@wikimedia.org

Olivier Beaton

11:12 p.m.

It seems to me like a rough grammar and an extensive test suite to verify the correctness of any parser is a much bigger win. Especially story based tests you end up with something that helps you write a parser and validate it at the same time.

It can also be used to validate our own parser.

On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...

+1 on doing HTML and Wikitext in the same parser, only because I've found that it is necessary, in my limited experience doing it in JS.

I'm not knowledgeable enough about the HTML5 error recovery spec to comment. I don't know of any other models for "recovery" in parsers out there, other than our own. I don't know how you would find out if the HTML5 way is appropriate for us other than trying it. Since it seems to point the way towards a more understandable means of normalizing wikitext, I would vote for it, but I'm voting from a position of relative ignorance.

Should we have a formal grammar? Let's be pragmatic -- a formal grammar is a means to a couple of ends as far as I see it.

1 - to easily have equivalent parsers in PHP and JS, and to allow the community to help develop it in an interactive way a la ParserPlayground.

This is not an either-or thing. If the parser is MOSTLY formal, that's good enough. But we should still be shooting for like 97% of the cases to be handled by the grammar.

2 - to give others a way to parse wikitext better.

This may not be necessary. If our parser can produce a nice abstract syntax tree at some point, the API can just emit some other regular format for people to use, perhaps XML or JSON based. Wikidom is more optimized for the editor, but it's probably also good for this purpose.

Then *maybe* one day we can transition to this more regular format, but that's a decision we'll probably face in 2013, if ever.

On 11/11/11 3:57 PM, Gabriel Wicke wrote:

...
Good evening,

this week I looked at different ways of cajoling overlapping, improperly nested or otherwise horrible but real-life wiki content into the WikiDom structure for consumption by the visual editor currently in development. So far, MediaWiki delegates the sanitization of those horrors to html tidy, which employs (mostly) good heuristics to make sense of its input.

The [HTML5] spec finally standardized parsing and error recovery for HTML, which seems to overlap widely with what we need for the new parser (how far?). Open-source reference implementations of the parser spec are available in Java [VNU] that compiles to C++ and Javascript (http://livedom.validator.nu/) through GWT, and PHP and Python ports at [HLib]. Modern browsers have similar implementations built in.

The reference parsers all use a relatively simple tokenizer in combination with a mostly switch-based parser / tree builder that constructs a cleaned-up DOM from the token stream. Tags are balanced and matched using a random-access stack, with a separate list of open formatting elements (very similar to the annotations in WikiDom). For each parsing context and token combination an error recovery strategy can be directly specified in a switch case.

The strength of this strategy is clearly the ease of implementing error recovery. The big disadvantage is the absence of a nicely declarative grammar, except perhaps a shallow one for the tokenizer. (Is there actually an example of a parser with serious HTML-like error recovery and an elegant grammar?)

In our specific visual editor application, performing a full error recovery / clean-up while constructing the WikiDom is at odds with the desire to round-trip wiki source. Performing full sanitation only in the HTML serializer while doing none in the Wikitext serializer seems to be a better fit. The WikiDom design with its support for overlapping annotations allows the omission of most early sanitation for inline elements. Block-level constructs however still need to be fully parsed so that implicit scopes of inline elements can be determined (e.g., limiting the range of annotations to table cells) and a DOM tree can be built. This tree then allows the visual editor to present some sensible, editable outline of the document.

A possible implementation could use a simplified version of the current PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a token stream to a parser / tree builder modeled on the HTML5 parsers. Separating the sanitation of inline and block-level elements to minimize early sanitation seems to be quite doable.

What do you think about this general direction of building on HTML parsers? Where should a wiki parser differ in its error recovery strategy? How important is having a full grammar?

Gabriel

[HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing [VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/ Live JS parser demo: http://livedom.validator.nu/ [HLib] PHP and Python parsers: http://code.google.com/p/html5lib/

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar ( ) neilk@wikimedia.org

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Paul Charlton

13 Nov 13 Nov

12:18 a.m.

An xsd + formal token list (for text elements)+ formal grammar seems most maintainable, explainable, and durable. Maybe put pages with exceptions on a flag list for human queue/review and repair. The fun part is the parsing during active edits. -Paul

On Nov 12, 2011, at 20:13, "Olivier Beaton" olivier.beaton@gmail.com wrote:

...

It seems to me like a rough grammar and an extensive test suite to verify the correctness of any parser is a much bigger win. Especially story based tests you end up with something that helps you write a parser and validate it at the same time.

It can also be used to validate our own parser.

On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...
+1 on doing HTML and Wikitext in the same parser, only because I've found that it is necessary, in my limited experience doing it in JS.

I'm not knowledgeable enough about the HTML5 error recovery spec to comment. I don't know of any other models for "recovery" in parsers out there, other than our own. I don't know how you would find out if the HTML5 way is appropriate for us other than trying it. Since it seems to point the way towards a more understandable means of normalizing wikitext, I would vote for it, but I'm voting from a position of relative ignorance.

Should we have a formal grammar? Let's be pragmatic -- a formal grammar is a means to a couple of ends as far as I see it.

1 - to easily have equivalent parsers in PHP and JS, and to allow the community to help develop it in an interactive way a la ParserPlayground.

This is not an either-or thing. If the parser is MOSTLY formal, that's good enough. But we should still be shooting for like 97% of the cases to be handled by the grammar.

2 - to give others a way to parse wikitext better.

This may not be necessary. If our parser can produce a nice abstract syntax tree at some point, the API can just emit some other regular format for people to use, perhaps XML or JSON based. Wikidom is more optimized for the editor, but it's probably also good for this purpose.

Then *maybe* one day we can transition to this more regular format, but that's a decision we'll probably face in 2013, if ever.

On 11/11/11 3:57 PM, Gabriel Wicke wrote:

...
Good evening,

this week I looked at different ways of cajoling overlapping, improperly nested or otherwise horrible but real-life wiki content into the WikiDom structure for consumption by the visual editor currently in development. So far, MediaWiki delegates the sanitization of those horrors to html tidy, which employs (mostly) good heuristics to make sense of its input.

The [HTML5] spec finally standardized parsing and error recovery for HTML, which seems to overlap widely with what we need for the new parser (how far?). Open-source reference implementations of the parser spec are available in Java [VNU] that compiles to C++ and Javascript (http://livedom.validator.nu/) through GWT, and PHP and Python ports at [HLib]. Modern browsers have similar implementations built in.

The reference parsers all use a relatively simple tokenizer in combination with a mostly switch-based parser / tree builder that constructs a cleaned-up DOM from the token stream. Tags are balanced and matched using a random-access stack, with a separate list of open formatting elements (very similar to the annotations in WikiDom). For each parsing context and token combination an error recovery strategy can be directly specified in a switch case.

The strength of this strategy is clearly the ease of implementing error recovery. The big disadvantage is the absence of a nicely declarative grammar, except perhaps a shallow one for the tokenizer. (Is there actually an example of a parser with serious HTML-like error recovery and an elegant grammar?)

In our specific visual editor application, performing a full error recovery / clean-up while constructing the WikiDom is at odds with the desire to round-trip wiki source. Performing full sanitation only in the HTML serializer while doing none in the Wikitext serializer seems to be a better fit. The WikiDom design with its support for overlapping annotations allows the omission of most early sanitation for inline elements. Block-level constructs however still need to be fully parsed so that implicit scopes of inline elements can be determined (e.g., limiting the range of annotations to table cells) and a DOM tree can be built. This tree then allows the visual editor to present some sensible, editable outline of the document.

A possible implementation could use a simplified version of the current PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a token stream to a parser / tree builder modeled on the HTML5 parsers. Separating the sanitation of inline and block-level elements to minimize early sanitation seems to be quite doable.

What do you think about this general direction of building on HTML parsers? Where should a wiki parser differ in its error recovery strategy? How important is having a full grammar?

Gabriel

[HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing [VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/ Live JS parser demo: http://livedom.validator.nu/ [HLib] PHP and Python parsers: http://code.google.com/p/html5lib/

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar ( ) neilk@wikimedia.org

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Gabriel Wicke

14 Nov 14 Nov

3:50 a.m.

...

Should we have a formal grammar? Let's be pragmatic -- a formal grammar is a means to a couple of ends as far as I see it.

1 - to easily have equivalent parsers in PHP and JS, and to allow the community to help develop it in an interactive way a la ParserPlayground.

This is not an either-or thing. If the parser is MOSTLY formal, that's good enough. But we should still be shooting for like 97% of the cases to be handled by the grammar.

97% of the context-free portions might be possible, but my feeling is that once you start pushing what context-free grammars can directly do, then the grammar quickly becomes really messy and hard to maintain or comprehend. The context-free portion contains most wiki syntax, but does not cover larger-scale structures including HTML tags due to overlapping markup.

Converting arbitrarily overlapped structures (or tag soup in general) to a sensible *tree* requires random-access stacks, and falls outside CFGs. Different strategies are possible in this space, with the HTML5 spec being one.

AFAICT there are no popular formalisms for automata with random-access stacks, so any standardization will probably look very much like the HTML5 spec: a discussion of all cases in prose. If the HTML5 spec turns out to be good enough, then we don't have to standardize that part of the parser and have implementations in different languages and browsers already available, which would be good for portability.

...

2 - to give others a way to parse wikitext better.

This may not be necessary. If our parser can produce a nice abstract syntax tree at some point, the API can just emit some other regular format for people to use, perhaps XML or JSON based. Wikidom is more optimized for the editor, but it's probably also good for this purpose.

Even an annotated HTML DOM (using the data-* attributes for example) could be used. We might actually be able to off-load most context-sensitive parts of the parsing process to the browser's HTML parser by feeding it pre-tokenized HTML tag soup, for example via .innerHTML.

Gabriel

Brion Vibber

4:44 p.m.

This seems like a really smart way to go. Wikitext includes a large subset of HTML; working with that in our parser design rather than against it should make things easier, and should allow making more of the spec (and some implementations) just be "reference the HTML5 spec here, but restricted to values as XYZ".

Where we have structures that don't quite fit with the HTML-style tree this may still require us to have elements that, at that level, get represented by separate start and end tags rather than a single 'HTML element'.

-- brion

On Mon, Nov 14, 2011 at 12:50 AM, Gabriel Wicke wicke@wikidev.net wrote:

...

...
Should we have a formal grammar? Let's be pragmatic -- a formal grammar is a means to a couple of ends as far as I see it.

1 - to easily have equivalent parsers in PHP and JS, and to allow the community to help develop it in an interactive way a la ParserPlayground.

This is not an either-or thing. If the parser is MOSTLY formal, that's good enough. But we should still be shooting for like 97% of the cases to be handled by the grammar.

97% of the context-free portions might be possible, but my feeling is that once you start pushing what context-free grammars can directly do, then the grammar quickly becomes really messy and hard to maintain or comprehend. The context-free portion contains most wiki syntax, but does not cover larger-scale structures including HTML tags due to overlapping markup.

Converting arbitrarily overlapped structures (or tag soup in general) to a sensible *tree* requires random-access stacks, and falls outside CFGs. Different strategies are possible in this space, with the HTML5 spec being one.

AFAICT there are no popular formalisms for automata with random-access stacks, so any standardization will probably look very much like the HTML5 spec: a discussion of all cases in prose. If the HTML5 spec turns out to be good enough, then we don't have to standardize that part of the parser and have implementations in different languages and browsers already available, which would be good for portability.

...
2 - to give others a way to parse wikitext better.

This may not be necessary. If our parser can produce a nice abstract syntax tree at some point, the API can just emit some other regular format for people to use, perhaps XML or JSON based. Wikidom is more optimized for the editor, but it's probably also good for this purpose.

Even an annotated HTML DOM (using the data-* attributes for example) could be used. We might actually be able to off-load most context-sensitive parts of the parsing process to the browser's HTML parser by feeding it pre-tokenized HTML tag soup, for example via .innerHTML.

Gabriel

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Neil Kandalgaonkar

4:58 p.m.

On 11/14/11 12:50 AM, Gabriel Wicke wrote:

...

...
2 - to give others a way to parse wikitext better.

Even an annotated HTML DOM (using the data-* attributes for example) could be used. We might actually be able to off-load most context-sensitive parts of the parsing process to the browser's HTML parser by feeding it pre-tokenized HTML tag soup, for example via .innerHTML.

I'm not sure what you are proposing -- are you suggesting that we let some anomalies persist and let the browser take care of it?

IMO we should be shooting for server APIs that give users very clean data structures, so they can transform them however they like. HTML should be just one of the output formats.

-- Neil Kandalgaonkar (| neilk@wikimedia.org

Gabriel Wicke

5:50 p.m.

...

...
Even an annotated HTML DOM (using the data-* attributes for example) could be used. We might actually be able to off-load most context-sensitive parts of the parsing process to the browser's HTML parser by feeding it pre-tokenized HTML tag soup, for example via .innerHTML.

I'm not sure what you are proposing -- are you suggesting that we let some anomalies persist and let the browser take care of it?

Yes and no ;) I was speculating on the possibility of using the built-in HTML5 parser of modern browsers to implement part of our in-browser parsing pipeline, especially for the visual editor. When we feed tag soup produced by a CFG-based tokenizer to a modern browser (e.g., FF4+) with an HTML5 parser using .innerHTML, it will sanitize the input according to the HTML5 parser spec. If we then read the .innerHTML back, we'll get a sanitized serialization (see example at end). But we could just use the cleaned-up DOM fragment of course, and walk that and turn it into WikiDom.

This is just an idea at this stage, and there might be more issues that sink it. Especially the preservation of overlaps in annotations might be tricky. HTML5 parsers break overlapping ranges up into non-overlapping ones, so they would need to be merged back together when building the WikiDom. Alternatively, there are Javascript libraries implementing the HTML5 parser spec which can be modified if plain HTML5 behavior is not ideal.

...

IMO we should be shooting for server APIs that give users very clean data structures, so they can transform them however they like. HTML should be just one of the output formats.

I completely agree- on the server side, higher-level parsing into a suitable tree (DOM or else) or corresponding SAX events would be performed by a (possibly modified) HTML5 parser. The output of this parser is in no way limited to HTML.

Gabriel

Example in FF 4+:

...

...
...
document.body.innerHTML = "bbbbiiii"

"bbbbiiii"

...

...
...
document.body.innerHTML

"bbbbiiii"

4803

Age (days ago)

4806

Last active (days ago)

wikitext-l@lists.wikimedia.org

7 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Gabriel Wicke
Neil Kandalgaonkar
Olivier Beaton
Paul Charlton