On 04/02/2015 06:17 PM, Subramanya Sastry wrote:
Greetings!
Among other things, Parsoid converts HTML to wikitext. There are two requirements as far as this serialization / conversion goes:
- Preserving HTML -> HTML semantics (i.e. a list when serialized to
wikitext and parsed back should render as an identical list)
- Enforcing certain wikitext norms (i.e. use <nowiki/> to clarify
quote parsing instead of <nowiki>'</nowiki>; serialize categories on their own line, etc.)
In a large set of scenarios, there is no conflict between these two requirements, i.e. Parsoid's serialization to conform to certain wikitext style norms preserves the HTML -> HTML semantics.
But, there are some scenarios where these requirements are in conflict. This conflict arises when Parsoid receives HTML that is malformed [1], or sends HTML that has no representation in wikitext [2], or sends HTML that will serialize to wikitext that editors will complain about [3]. We have adopted a somewhat adhoc strategy so far while favouring the HTML -> HTML preservation strategy.
Correction: s/sends HTML/receives HTML/g ... My concern in this email is entirely about HTML that Parsoid receives that it has to convert to wikitext.
Subbu.
In some scenarios like [1], it is easier to argue that Parsoid should break HTML -> HTML semantics by effectively blaming the breakage in HTML roundripping on the client.
But, it is a little less clear in scenarios like [3]. ==<nowiki/>== is clearly valid wikitext and can be parsed back to preserve HTML semantics, there is a good argument to be made to also drop such empty headers in Parsoid. So, even though WYSIWYG model breaks here, that could be seen as a less serious bug that outputting ==<nowiki/>==.
That said, we definitely do not want to get into the business of implementing ad hoc heuristics to work around client bugs (while we have and could continue implement temporary workarounds till client bugs are fixed). That is a slippery slope to code complexity.
So, two questions to answer here:
- Are there a set of wikitext norms that are applicable across wikis
and should be enforced as a syntactic style standard by Parsoid independent of clients? What are they and can they be documented on a wiki page by the editor community? Dont' emit "==<nowiki/>==" be one of those?
- Independent of (1), should Parsoid implement an optional HTML
normalization pass that it applies on behalf of clients when the right API parameter is passed in? This might be useful in scenarios where it is simpler to fix bad HTML than prevent the generation of bad HTML.
In an ideal situation, if we can establish norms for (1), it will eliminate the need for (2) -- (2) is less desirable and is mostly a fallback beyond (1).
Thanks, Subbu.
[1] T94599: https://phabricator.wikimedia.org/T94599 [2] Example: <a data-my-attribute="foo" href="http://google.com">foo</a> or see T94766: https://phabricator.wikimedia.org/T94766 [3] T94867: https://phabricator.wikimedia.org/T94867