Re: [Wikitext-l] Stealing error recovery from HTML5 parsers

13 Nov 2011

It seems to me like a rough grammar and an extensive test suite to
verify the correctness of any parser is a much bigger win. Especially
story based tests you end up with something that helps you write a
parser and validate it at the same time.

It can also be used to validate our own parser.

On Sat, Nov 12, 2011 at 1:58 AM, Neil Kandalgaonkar &lt;neilk(a)wikimedia.org&gt; wrote:
...
  +1 on doing HTML and Wikitext in the same parser, only
because I've
 found that it is necessary, in my limited experience doing it in JS.

 I'm not knowledgeable enough about the HTML5 error recovery spec to
 comment. I don't know of any other models for "recovery" in parsers out
 there, other than our own. I don't know how you would find out if the
 HTML5 way is appropriate for us other than trying it. Since it seems to
 point the way towards a more understandable means of normalizing
 wikitext, I would vote for it, but I'm voting from a position of
 relative ignorance.

 Should we have a formal grammar? Let's be pragmatic -- a formal grammar
 is a means to a couple of ends as far as I see it.

 1 - to easily have equivalent parsers in PHP and JS, and to allow the
 community to help develop it in an interactive way a la ParserPlayground.

 This is not an either-or thing. If the parser is MOSTLY formal, that's
 good enough. But we should still be shooting for like 97% of the cases
 to be handled by the grammar.

 2 - to give others a way to parse wikitext better.

 This may not be necessary. If our parser can produce a nice abstract
 syntax tree at some point, the API can just emit some other regular
 format for people to use, perhaps XML or JSON based. Wikidom is more
 optimized for the editor, but it's probably also good for this purpose.

 Then *maybe* one day we can transition to this more regular format, but
 that's a decision we'll probably face in 2013, if ever.

 On 11/11/11 3:57 PM, Gabriel Wicke wrote:
  Good evening,

 this week I looked at different ways of cajoling overlapping, improperly
 nested or otherwise horrible but real-life wiki content into the WikiDom
 structure for consumption by the visual editor currently in development.
 So far, MediaWiki delegates the sanitization of those horrors to html
 tidy, which employs (mostly) good heuristics to make sense of its input.

 The [HTML5] spec finally standardized parsing and error recovery for
 HTML, which seems to overlap widely with what we need for the new parser
 (how far?). Open-source reference implementations of the parser spec are
 available in Java [VNU] that compiles to C++ and Javascript
 (http://livedom.validator.nu/) through GWT, and PHP and Python ports at
 [HLib]. Modern browsers have similar implementations built in.

 The reference parsers all use a relatively simple tokenizer in
 combination with a mostly switch-based parser / tree builder that
 constructs a cleaned-up DOM from the token stream. Tags are balanced and
 matched using a random-access stack, with a separate list of open
 formatting elements (very similar to the annotations in WikiDom). For
 each parsing context and token combination an error recovery strategy
 can be directly specified in a switch case.

 The strength of this strategy is clearly the ease of implementing  error
 recovery. The big disadvantage is the absence of a nicely declarative
 grammar, except perhaps a shallow one for the tokenizer. (Is there
 actually an example of a parser with serious HTML-like error recovery
 and an elegant grammar?)

 In our specific visual editor application, performing a full error
 recovery / clean-up while constructing the WikiDom is at odds with the
 desire to round-trip wiki source. Performing full sanitation only in the
 HTML serializer while doing none in the Wikitext serializer seems to be
 a better fit. The WikiDom design with its support for overlapping
 annotations allows the omission of most early sanitation for inline
 elements. Block-level constructs however still need to be fully parsed
 so that implicit scopes of inline elements can be determined (e.g.,
 limiting the range of annotations to table cells) and a DOM tree can be
 built. This tree then allows the visual editor to present some sensible,
 editable outline of the document.

 A possible implementation could use a simplified version of the current
 PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a
 token stream to a parser / tree builder modeled on the HTML5 parsers.
 Separating the sanitation of inline and block-level elements to minimize
 early sanitation seems to be quite doable.

 What do you think about this general direction of building on HTML
 parsers? Where should a wiki parser differ in its error recovery
 strategy? How important is having a full grammar?

 Gabriel

 [HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing
 [VNU]   Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/
          Live JS parser demo: http://livedom.validator.nu/
 [HLib]  PHP and Python parsers: http://code.google.com/p/html5lib/

 _______________________________________________
 Wikitext-l mailing list
 Wikitext-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitext-l 
 --
 Neil Kandalgaonkar ( ) &lt;neilk(a)wikimedia.org&gt;

 _______________________________________________
 Wikitext-l mailing list
 Wikitext-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitext-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Stealing error recovery from HTML5 parsers