This seems like a really smart way to go. Wikitext includes a large subset
of HTML; working with that in our parser design rather than against it
should make things easier, and should allow making more of the spec (and
some implementations) just be "reference the HTML5 spec here, but
restricted to values as XYZ".
Where we have structures that don't quite fit with the HTML-style tree this
may still require us to have elements that, at that level, get represented
by separate start and end tags rather than a single 'HTML element'.
-- brion
On Mon, Nov 14, 2011 at 12:50 AM, Gabriel Wicke <wicke(a)wikidev.net> wrote:
Should we have
a formal grammar? Let's be pragmatic -- a formal grammar
is a means to a couple of ends as far as I see it.
1 - to easily have equivalent parsers in PHP and JS, and to allow the
community to help develop it in an interactive way a la ParserPlayground.
This is not an either-or thing. If the parser is MOSTLY formal, that's
good enough. But we should still be shooting for like 97% of the cases
to be handled by the grammar.
97% of the context-free portions might be possible, but my feeling is
that once you start pushing what context-free grammars can directly do,
then the grammar quickly becomes really messy and hard to maintain or
comprehend. The context-free portion contains most wiki syntax, but does
not cover larger-scale structures including HTML tags due to overlapping
markup.
Converting arbitrarily overlapped structures (or tag soup in general) to
a sensible *tree* requires random-access stacks, and falls outside CFGs.
Different strategies are possible in this space, with the HTML5 spec
being one.
AFAICT there are no popular formalisms for automata with random-access
stacks, so any standardization will probably look very much like the
HTML5 spec: a discussion of all cases in prose. If the HTML5 spec turns
out to be good enough, then we don't have to standardize that part of
the parser and have implementations in different languages and browsers
already available, which would be good for portability.
2 - to give others a way to parse wikitext
better.
This may not be necessary. If our parser can produce a nice abstract
syntax tree at some point, the API can just emit some other regular
format for people to use, perhaps XML or JSON based. Wikidom is more
optimized for the editor, but it's probably also good for this purpose.
Even an annotated HTML DOM (using the data-* attributes for example)
could be used. We might actually be able to off-load most
context-sensitive parts of the parsing process to the browser's HTML
parser by feeding it pre-tokenized HTML tag soup, for example via
.innerHTML.
Gabriel
_______________________________________________
Wikitext-l mailing list
Wikitext-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l