Greetings!
Among other things, Parsoid converts HTML to wikitext. There are two
requirements as far as this serialization / conversion goes:
* Preserving HTML -> HTML semantics (i.e. a list when serialized to
wikitext and parsed back should render as an identical list)
* Enforcing certain wikitext norms (i.e. use <nowiki/> to clarify quote
parsing instead of <nowiki>'</nowiki>; serialize categories on their own
line, etc.)
In a large set of scenarios, there is no conflict between these two
requirements, i.e. Parsoid's serialization to conform to certain
wikitext style norms preserves the HTML -> HTML semantics.
But, there are some scenarios where these requirements are in conflict.
This conflict arises when Parsoid receives HTML that is malformed [1],
or sends HTML that has no representation in wikitext [2], or sends HTML
that will serialize to wikitext that editors will complain about [3]. We
have adopted a somewhat adhoc strategy so far while favouring the HTML
-> HTML preservation strategy.
In some scenarios like [1], it is easier to argue that Parsoid should
break HTML -> HTML semantics by effectively blaming the breakage in HTML
roundripping on the client.
But, it is a little less clear in scenarios like [3]. ==<nowiki/>== is
clearly valid wikitext and can be parsed back to preserve HTML
semantics, there is a good argument to be made to also drop such empty
headers in Parsoid. So, even though WYSIWYG model breaks here, that
could be seen as a less serious bug that outputting ==<nowiki/>==.
That said, we definitely do not want to get into the business of
implementing ad hoc heuristics to work around client bugs (while we have
and could continue implement temporary workarounds till client bugs are
fixed). That is a slippery slope to code complexity.
So, two questions to answer here:
1. Are there a set of wikitext norms that are applicable across wikis
and should be enforced as a syntactic style standard by Parsoid
independent of clients? What are they and can they be documented on a
wiki page by the editor community? Dont' emit "==<nowiki/>==" be one
of
those?
2. Independent of (1), should Parsoid implement an optional HTML
normalization pass that it applies on behalf of clients when the right
API parameter is passed in? This might be useful in scenarios where it
is simpler to fix bad HTML than prevent the generation of bad HTML.
In an ideal situation, if we can establish norms for (1), it will
eliminate the need for (2) -- (2) is less desirable and is mostly a
fallback beyond (1).
Thanks,
Subbu.
[1] T94599:
https://phabricator.wikimedia.org/T94599
[2] Example: <a data-my-attribute="foo"
href="http://google.com">foo</a>
or see T94766:
https://phabricator.wikimedia.org/T94766
[3] T94867:
https://phabricator.wikimedia.org/T94867