Hello,
so far the HTML5 parser integration seems to have turned out quite well. 180 parser tests are now passing, and most of the remaining ones are about missing functionality in later stages of the parser pipeline.
Since this week, the produced HTMl DOM can also be converted to WikiDom (or close to it). A sample result of the [[en:Barack Obama]] article is available at http://dev.wikidev.net/gabriel/tmp/obama.wikidom.txt. The unoptimized parse to WikiDom without template expansions etc currently takes about 35 seconds on my laptop.
The various moving parts of the setup (and how to try it out) are described in https://www.mediawiki.org/wiki/Future/Parser_development. In glorious ASCII, it might look roughly like this:
PEG wiki/HTML tokenizer (could also be any SAX-style parser) -> Token stream transformations -> HTML5 tree builder -> HTML DOM tree -> DOM Postprocessors +-> (X)HTML +-> DOMConverter -> WikiDom -> Visual Editor
The tokenizer is built from a completely static grammar, and leaves all configuration-dependent behavior to later stages. Most interesting bits happen in token stream transformations, which are dispatched using a registration mechanism by token type. The order of handlers can be specified, and early handlers can abort further processing for a token. Syntax-specific transformations on a token can register for early processing, so that later transformations on a token can operate on a normalized version of the token. MediaWiki's special quote handling for italic/bold for example is implemented in a core extension that registers handlers for 'quote', 'newline' and the special 'eof' token. Lists and a simple version of the Cite extension are similarly implemented. A general emulation of parser hook behavior on top of the token stream is quite straightforward. Both collected tokens between tags and plain text based on source positions noted in tokens are available.
The token transform dispatcher class is prepared for asynchronous processing of tokens, which is already used in a synchronous fashion for the back-reference behavior of the italic/bold extension. This ability to overlap operations on multiple tokens will be very important for template expansions. Doing template expansions on the token level makes it possible to render unbalanced templates like the table start / row / end combinations for viewing, while encapsulating those if the output is destined for the visual editor. Template expansion is currently WIP.
So far for now, looking forward to your thoughts!
Gabriel