For context: I've been working on replacing the html5 and jsdom modules (which depend on the native 'contextify' module) with the pure-javascript 'domino' implementation of DOM4. This seems to be faster, cleaner, and fix some bug caused by jsdom's eccentric DOM handling. Domino is (in my brief experience) more reliable and standards-compliant.
Here's a list of issues I came across in the process:
* There were 3 new failures in wt2html tests. (There were also some new passes, so the number of correct tests increases on net.) They are:
1) "expansion of multi-line templates in attribute values (bug 6255 sanity check 2)" For reference, this test looks like:
!! test
Expansion of multi-line templates in attribute values (bug 6255 sanity check) !! input
<div style="background: #00FF00">-</div> !! result <div style="background: #00FF00">-</div> !! end !! test Expansion of multi-line templates in attribute values (bug 6255 sanity check 2) !! input <div style="background: #00FF00">-</div> !! result <div style="background: #00FF00">-</div> !! end
I'm not sure how this test ever passed in jsdom -- the inputs here are actually identical to an HTML parser, since hex-escape decoding happens very early. But apparently the wikitext parser should defer processing of the 
 somehow? On the domino branch our HTML serialization now uses the upstream standard HTML5-serialization algorithm, which doesn't escape newlines. ( http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#ser...) Note that the first test also involves whitespace normalization, which the PHP parser does (see https://www.mediawiki.org/wiki/Special:Code/MediaWiki/14689) but parsoid does not do. (I've got a patch to do whitespace normalization in parsoid if there's interest, but it causes other tests to break.)
What's the plan to handle cases like this? Is it really important to generate the in the output?
2) "Play a bit with r67090 and bug 3158" This is a parsoid-only test which looks like:
!! test Play a bit with r67090 and bug 3158 !! options disabled !! input
<div style="width:50% !important"> </div> <div style="width:50% !important"> </div> <div style="width:50% !important"> </div> <div style="border : solid;"> </div> !! result <div style="width:50% !important"> </div> <div style="width:50% !important"> </div> <div style="width:50% !important"> </div> <div style="border : solid;"> </div> !! end
In standard HTML serialization,   is encoded uniformly as so even if you wanted to be bug-compatible with the 'border :' style, you should be emitting a not a   there. The other two cases are whitespace normalization within attributes (again). I'm guessing jsdom (incorrectly) did this by default whether you wanted it or not; you need to explicitly add attribute-normalization into the domino case if that's desired. (But there's some other reason why the 'border :' case is failing now which needs to be chased down, unrelated to the   vs issue.)
3) "Parsoid-only: Table with broken attribute value quoting on consecutive lines"
!! test Parsoid-only: Table with broken attribute value quoting on consecutive lines !! options disabled !! input {| | title="Hello world|Foo | style="color:red|Bar |} !! result
<table> <tr> <td title="Hello world">Foo </td><td style="color: red;">Bar </td></tr></table> !! end
jsdom used to insert the extraneous semicolon at the end of the 'style' attribute. domino does not. I believe this test case is broken and the extraneous semicolon should be removed.
* Other observed bugs & failures: http://parsoid.wmflabs.org/en/Pi gives:
TypeError: Cannot assign to read only property 'ksrc' of #<KV> at AttributeExpander._returnAttributes (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:71:20) at AttributeTransformManager.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:1017:8) at AttributeExpander.onToken (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:46:7) at AsyncTokenTransformManager.transformTokens (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:568:17) at AsyncTokenTransformManager.onChunk (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:356:17) at SyncTokenTransformManager.EventEmitter.emit (events.js:96:17) at SyncTokenTransformManager.onChunk (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:904:7) at PegTokenizer.EventEmitter.emit (events.js:96:17) at PegTokenizer.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.tokenizer.peg.js:88:11) at ParserPipeline.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.parser.js:360:21)
http://localhost:8000/simple/Game gives:
starting parsing of Game *********** ERROR: cs/s mismatch for node: A s: 3808; cs: 3821 ************ completed parsing of Game in 1491 ms
* [[File:]] tag parsing for images appears to be incomplete: a) alt= and class= are not parsed b) 'thumb' and 'right' should result in <img class="thumb tright" /> or some such, but there doesn't appear to be an indication of either option in the parsoid output.
* I'd like to see title and revision information in the <head>
* Interwiki links are not converted to relative links when the "interwiki" is actually the current wiki. (Maybe this isn't really a bug.)
Let's discuss these a bit and I'll file bugzilla tickets for the bits we can agree are actually bugs. ;) --scott
Scott,
On 02/19/2013 03:50 PM, C. Scott Ananian wrote:
<div style="background: #00FF00">-</div>
What's the plan to handle cases like this? Is it really important to generate the in the output?
the is not important in wt2html mode, but should (ideally) be preserved in wt2wt mode. Client DOM implementations don't preserve the entity encoding, so we'd have to handle this with encoding-tolerant attribute shadowing in the serializer. Handling all attribute normalization cases explicitly would introduce a lot of complexity however, so we prefer to hide these purely syntactic diffs in unmodified content with selective serialization.
- "Play a bit with r67090 and bug 3158"
This is a parsoid-only test which looks like:
In standard HTML serialization,   is encoded uniformly as so even if you wanted to be bug-compatible with the 'border :' style, you should be emitting a not a   there. The other two cases are whitespace normalization within attributes (again). I'm guessing jsdom (incorrectly) did this by default whether you wanted it or not; you need to explicitly add attribute-normalization into the domino case if that's desired.
No normalization in the DOM makes round-tripping easier, and should lead to the same DOM with HTML5-compliant parsers anyway.
- "Parsoid-only: Table with broken attribute value quoting on
consecutive lines"
jsdom used to insert the extraneous semicolon at the end of the 'style' attribute. domino does not. I believe this test case is broken and the extraneous semicolon should be removed.
Same issue- CSS normalization in JSDOM vs. none in Domino.
The following are really bug reports.
- Other observed bugs & failures:
http://parsoid.wmflabs.org/en/Pi gives:
TypeError: Cannot assign to read only property 'ksrc' of #<KV> at AttributeExpander._returnAttributes (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:71:20) at AttributeTransformManager.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:1017:8) at AttributeExpander.onToken (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:46:7) at AsyncTokenTransformManager.transformTokens (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:568:17) at AsyncTokenTransformManager.onChunk (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:356:17) at SyncTokenTransformManager.EventEmitter.emit (events.js:96:17) at SyncTokenTransformManager.onChunk (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:904:7) at PegTokenizer.EventEmitter.emit (events.js:96:17) at PegTokenizer.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.tokenizer.peg.js:88:11) at ParserPipeline.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.parser.js:360:21)
http://localhost:8000/simple/Game gives:
starting parsing of Game *********** ERROR: cs/s mismatch for node: A s: 3808; cs: 3821 ************ completed parsing of Game in 1491 ms
- [[File:]] tag parsing for images appears to be incomplete: a) alt= and class= are not parsed b) 'thumb' and 'right' should result in <img class="thumb tright" />
or some such, but there doesn't appear to be an indication of either option in the parsoid output.
- I'd like to see title and revision information in the <head>
This is planned, but not yet implemented.
- Interwiki links are not converted to relative links when the
"interwiki" is actually the current wiki. (Maybe this isn't really a bug.)
The PHP parser handles these as internal links, so we should do. We just need to make sure that we preserve the prefix when round-tripping an unmodified link target. The target is already shadowed, so I think the prefix should already be preserved.
Gabriel
- Other observed bugs & failures:
http://parsoid.wmflabs.org/en/Pi gives:
TypeError: Cannot assign to read only property 'ksrc' of #<KV> at AttributeExpander._returnAttributes (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:71:20) at AttributeTransformManager.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:1017:8) at AttributeExpander.onToken (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/ext.core.AttributeExpander.js:46:7) at AsyncTokenTransformManager.transformTokens (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:568:17) at AsyncTokenTransformManager.onChunk (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:356:17) at SyncTokenTransformManager.EventEmitter.emit (events.js:96:17) at SyncTokenTransformManager.onChunk (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.TokenTransformManager.js:904:7) at PegTokenizer.EventEmitter.emit (events.js:96:17) at PegTokenizer.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.tokenizer.peg.js:88:11) at ParserPipeline.process (/home/cananian/Projects/OLPC/Narrative/mediawiki/Parsoid/js/lib/mediawiki.parser.js:360:21)
Fix submitted in https://gerrit.wikimedia.org/r/#/c/50014/
Subbu.
Scott,
On 02/19/2013 03:50 PM, C. Scott Ananian wrote:
http://localhost:8000/simple/Game gives:
starting parsing of Game *********** ERROR: cs/s mismatch for node: A s: 3808; cs: 3821 ************ completed parsing of Game in 1491 ms
this is really a warning about an inconsistency from the DSR (DOM source range) calculation algorithm. We should probably change the wording to make it less ominous.
- [[File:]] tag parsing for images appears to be incomplete: a) alt= and class= are not parsed
This is https://bugzilla.wikimedia.org/show_bug.cgi?id=45208.
b) 'thumb' and 'right' should result in <img class="thumb tright" /> or some such, but there doesn't appear to be an indication of either option in the parsoid output.
This is definitely a regression. Tracked in https://bugzilla.wikimedia.org/show_bug.cgi?id=45207.
- I'd like to see title and revision information in the <head>
https://bugzilla.wikimedia.org/show_bug.cgi?id=45206
- Interwiki links are not converted to relative links when the
"interwiki" is actually the current wiki. (Maybe this isn't really a bug.)
This is https://bugzilla.wikimedia.org/show_bug.cgi?id=45209.
Cheers,
Gabriel
On 02/20/2013 03:04 PM, Gabriel Wicke wrote:
Scott,
On 02/19/2013 03:50 PM, C. Scott Ananian wrote:
http://localhost:8000/simple/Game gives:
starting parsing of Game *********** ERROR: cs/s mismatch for node: A s: 3808; cs: 3821 ************ completed parsing of Game in 1491 ms
this is really a warning about an inconsistency from the DSR (DOM source range) calculation algorithm. We should probably change the wording to make it less ominous.
Fixed in https://gerrit.wikimedia.org/r/#/c/50051/
Subbu.
wikitext-l@lists.wikimedia.org