This seems like a really smart way to go. Wikitext includes a large subset of HTML; working with that in our parser design rather than against it should make things easier, and should allow making more of the spec (and some implementations) just be "reference the HTML5 spec here, but restricted to values as XYZ".

Where we have structures that don't quite fit with the HTML-style tree this may still require us to have elements that, at that level, get represented by separate start and end tags rather than a single 'HTML element'.

-- brion


On Mon, Nov 14, 2011 at 12:50 AM, Gabriel Wicke <wicke@wikidev.net> wrote:
> Should we have a formal grammar? Let's be pragmatic -- a formal grammar
> is a means to a couple of ends as far as I see it.
>
> 1 - to easily have equivalent parsers in PHP and JS, and to allow the
> community to help develop it in an interactive way a la ParserPlayground.
>
> This is not an either-or thing. If the parser is MOSTLY formal, that's
> good enough. But we should still be shooting for like 97% of the cases
> to be handled by the grammar.

97% of the context-free portions might be possible, but my feeling is
that once you start pushing what context-free grammars can directly do,
then the grammar quickly becomes really messy and hard to maintain or
comprehend. The context-free portion contains most wiki syntax, but does
not cover larger-scale structures including HTML tags due to overlapping
markup.

Converting arbitrarily overlapped structures (or tag soup in general) to
a sensible *tree* requires random-access stacks, and falls outside CFGs.
Different strategies are possible in this space, with the HTML5 spec
being one.

AFAICT there are no popular formalisms for automata with random-access
stacks, so any standardization will probably look very much like the
HTML5 spec: a discussion of all cases in prose. If the HTML5 spec turns
out to be good enough, then we don't have to standardize that part of
the parser and have implementations in different languages and browsers
already available, which would be good for portability.

> 2 - to give others a way to parse wikitext better.
>
> This may not be necessary. If our parser can produce a nice abstract
> syntax tree at some point, the API can just emit some other regular
> format for people to use, perhaps XML or JSON based. Wikidom is more
> optimized for the editor, but it's probably also good for this purpose.

Even an annotated HTML DOM (using the data-* attributes for example)
could be used. We might actually be able to off-load most
context-sensitive parts of the parsing process to the browser's HTML
parser by feeding it pre-tokenized HTML tag soup, for example via
.innerHTML.

Gabriel


_______________________________________________
Wikitext-l mailing list
Wikitext-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l