Hello,
I use parsoid to publish email messages into wiki and have a little issue. Sometimes generated article has "preformatted" fragments that do not have any special formatting in source text. After investigation I discovered that it is caused by spaces that start new line in HTML text. When source HTML of email is viewed in browser these spaces do not have any effect, but after converting to wikitext they became part of markup. Next, trying to discover they way parsoid works I have seen that normally these spaces became surronded with <nowiki> tag, but in some circumtances it does not happen.
So I made test HTML file to see different results of converting:
<html> <head> </head> <body>
<p>test2<span> test3 </span></p>
<p><span>test2 test3 </span></p>
<p>textx<span>test2 test3 </span></p>
</body> </html>
The result of conversion is:
test2<span> test3 </span>
<span>test2 <nowiki> </nowiki>test3 </span>
textx<span>test2 <nowiki> </nowiki>test3 </span>
It seems that if new line is just at end of <span> tag, <nowiki> is not inserted.
On Jul 22, 2019, at 5:11 AM, Sergey F sergey@fidoman.ru wrote:
<p>test2<span> test3 </span></p>
The result of conversion is:
test2<span> test3
</span>
Yes, this looks like a bug
See https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/524811
Thanks
On 7/22/19 10:51 AM, Arlo Breault wrote:
On Jul 22, 2019, at 5:11 AM, Sergey F sergey@fidoman.ru wrote:
<p>test2<span> test3 </span></p>
The result of conversion is:
test2<span> test3
</span>
Yes, this looks like a bug
See https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/524811
Thanks
Thanks Arlo!
Sergey:
It is possible that Arlo's bugfix will satisfy your use case.
However, note that Parsoid will introduce <nowiki> protection around characters that will parse differently if not escaped. So "<p> foo<p>" will convert to "<nowiki> </nowiki>foo". You can avoid this by passing the 'scrub_wikitext' flag to the html -> wikitext API endpoint [1]. This tells Parsoid to normalize[2] the input HTML to eliminate the need for those nowikis.
FYI in case this flag is pertinent to your use case.
Subbu.
1. https://www.mediawiki.org/wiki/Parsoid/API#For_HTML_-%3E_wikitext_requests
On 7/22/19 11:05 AM, Subramanya Sastry wrote:
On 7/22/19 10:51 AM, Arlo Breault wrote:
On Jul 22, 2019, at 5:11 AM, Sergey F sergey@fidoman.ru wrote:
<p>test2<span> test3 </span></p>
The result of conversion is:
test2<span> test3
</span>
Yes, this looks like a bug
See https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/524811
Thanks
Thanks Arlo!
Sergey:
It is possible that Arlo's bugfix will satisfy your use case.
It would have helped if I had actually seen Arlo's patch before I sent that email - he was fixing a case where we were not adding a nowiki where it should have been added.
So, you will need to pass the scrub_wikitext parameter if you want to avoid the nowikis. Or, you can normalize the HTML yourself before passing it to Parsoid.
Or, if you were just reporting the inconsistency, ignore my emails. :-)
Subbu.
wikitech-l@lists.wikimedia.org