Re: [Wikitext-l] Stealing error recovery from HTML5 parsers

14 Nov 2011

This seems like a really smart way to go. Wikitext includes a large subset
of HTML; working with that in our parser design rather than against it
should make things easier, and should allow making more of the spec (and
some implementations) just be "reference the HTML5 spec here, but
restricted to values as XYZ".

Where we have structures that don't quite fit with the HTML-style tree this
may still require us to have elements that, at that level, get represented
by separate start and end tags rather than a single 'HTML element'.

-- brion

On Mon, Nov 14, 2011 at 12:50 AM, Gabriel Wicke &lt;wicke(a)wikidev.net&gt; wrote:

...
   Should we have
a formal grammar? Let's be pragmatic -- a formal grammar
 is a means to a couple of ends as far as I see it.

 1 - to easily have equivalent parsers in PHP and JS, and to allow the
 community to help develop it in an interactive way a la ParserPlayground.

 This is not an either-or thing. If the parser is MOSTLY formal, that's
 good enough. But we should still be shooting for like 97% of the cases
 to be handled by the grammar. 
 97% of the context-free portions might be possible, but my feeling is
 that once you start pushing what context-free grammars can directly do,
 then the grammar quickly becomes really messy and hard to maintain or
 comprehend. The context-free portion contains most wiki syntax, but does
 not cover larger-scale structures including HTML tags due to overlapping
 markup.

 Converting arbitrarily overlapped structures (or tag soup in general) to
 a sensible *tree* requires random-access stacks, and falls outside CFGs.
 Different strategies are possible in this space, with the HTML5 spec
 being one.

 AFAICT there are no popular formalisms for automata with random-access
 stacks, so any standardization will probably look very much like the
 HTML5 spec: a discussion of all cases in prose. If the HTML5 spec turns
 out to be good enough, then we don't have to standardize that part of
 the parser and have implementations in different languages and browsers
 already available, which would be good for portability.

  2 - to give others a way to parse wikitext
better.

 This may not be necessary. If our parser can produce a nice abstract
 syntax tree at some point, the API can just emit some other regular
 format for people to use, perhaps XML or JSON based. Wikidom is more
 optimized for the editor, but it's probably also good for this purpose. 
 Even an annotated HTML DOM (using the data-* attributes for example)
 could be used. We might actually be able to off-load most
 context-sensitive parts of the parsing process to the browser's HTML
 parser by feeding it pre-tokenized HTML tag soup, for example via
 .innerHTML.

 Gabriel

 _______________________________________________
 Wikitext-l mailing list
 Wikitext-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitext-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Stealing error recovery from HTML5 parsers