On 12/13/2012 06:43 AM, Marco Fleckinger wrote:
Implementing this is not very easy, but developers can may use some of the old ideas. Parsing the other way around has to be realized really from the scratch but is easier because everything is in a tree. not in a single text-string.
Neither de- nor searalizing includes any surface, testing could be done automatically really easy comparing the results of conventional and the new parsing. The result of the serialization can be compared with the original markup.
Hi Marco,
we (the Parsoid team) have been doing many of the things you describe in the last year:
* We wrote a new bidirectional parser / serializer - see http://www.mediawiki.org/wiki/Parsoid. This includes a grammar-based tokenizer, async/parallel token stream transformations and HTML5 DOM building.
* We developed a HTML5 / RDFa document model spec at http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec.
* Our parserTests runner tests wt2html (wikitext to html), wt2wt, html2html and html2wt modes with the same wikitext / HTML pairs as used in the PHP parser tests. We have roughly doubled the number of such pairs in the process.
* Automated and distributed round-trip tests are currently run over a random selection of 100k English Wikipedia pages: http://parsoid.wmflabs.org:8001/. This test infrastructure can easily be pointed at a different set of pages or another wiki.
Parsoid is by no means complete, but we are very happy with how far we already got since last October.
Cheers,
Gabriel