Transcluding non-text content as HTML on wikitext pages

List overview All Threads
Download

newer

older

Re: [Wikitech-l] "What should be...

Output from Zurich Hackathon - yet...

Daniel Kinzler

13 May 2014 13 May '14

11:37 a.m.

Hi all!

During the hackathon, I worked on a patch that would make it possible for non-textual content to be included on wikitext pages using the template syntax. The idea is that if we have a content handler that e.g. generates awesome diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be able to use that output on a wiki page. But until now, that would have required the content handler to generate wikitext for the transclusion - not easily done.

So, I came up with a way for ContentHandler to wrap the HTML generated by another ContentHandler so it can be used for transclusion.

Have a look at the patch at https://gerrit.wikimedia.org/r/#/c/132710/. Note that I have completely rewritten it since my first version at the hackathon.

It would be great to get some feedback on this, and have it merged soon, so we can start using non-textual content to its full potential.

Here is a quick overview of the information flow. Let's assume we have a "template" page T that is supposed to be transcluded on a "target" page P; the template page uses the non-text content model X, while the target page is wikitext. So:

* When Parser parses P, it encounters {{T}} * Parser loads the Content object for T (an XContent object, for model X), and calls getTextForTransclusion() on it, with CONTENT_MODEL_WIKITEXT as the target format. * getTextForTransclusion() calls getContentForTransclusion() * getContentForTransclusion() calls convert( CONTENT_MODEL_WIKITEXT ) which fails (because content model X doesn't provide a wikitext representation). * getContentForTransclusion() then calls convertContentViaHtml() * convertContentViaHtml() calls getTextForTransclusion( CONTENT_MODEL_HTML ) to get the HTML representation. * getTextForTransclusion() calls getContentForTransclusion() calls convert() which handles the conversion to HTML by calling getHtml() directly. * convertContentViaHtml() takes the HTML and calls makeContentFromHtml() on the ContentHandler for wikitext. * makeContentFromHtml() replaces the actual HTML by a parser strip mark, and returns a WikitextContent containing this strip mark. * The strip mark is eventually returns to the original Parser instances, and used to replace {{T}} on the original page.

This essentialyl means that any content can be converted to HTML, and can be transcluded into any content that provides an implementation of makeContentFromHtml(). This actually changes how transclusion of JS and CSS pages into wikitext pages work. You can try this out by transclusing a JS page like MediaWiki:Test.js as a template on a wikitext page.

The old getWikitextForTransclusion() is now a shorthand for getTextForTransclusion( CONTENT_MODEL_WIKITEXT ).

As Brion pointed out in a comment to my original, there is another caveat: what should the expandtemplates module do when expanding non-wikitext templates? I decided to just wrap the HTML in <html>...</html> tags instead of using a strip mark in this case. The resulting wikitext is however only "correct" if $wgRawHtml is enabled, otherwise, the HTML will get mangled/escaped by wikitext parsing. This seems acceptable to me, but please let me know if you have a better idea.

So, let me know what you think! Daniel

Show replies by date

Brad Jorsch (Anomie)

13 May 13 May

3:38 p.m.

New subject: Transcluding non-text content as HTML on wikitext pages

On Tue, May 13, 2014 at 11:37 AM, Daniel Kinzler daniel@brightbyte.dewrote:

...

As Brion pointed out in a comment to my original, there is another caveat: what should the expandtemplates module do when expanding non-wikitext templates? I decided to just wrap the HTML in <html>...</html> tags instead of using a strip mark in this case. The resulting wikitext is however only "correct" if $wgRawHtml is enabled, otherwise, the HTML will get mangled/escaped by wikitext parsing. This seems acceptable to me, but please let me know if you have a better idea.

Just brainstorming:

To avoid the wikitext mangling, you could wrap it in some tag that works like <html> if $wgRawHtml is set and <pre> otherwise.

Or one step further, maybe a tag <foo wikitext="{{P}}">html goes here</foo> that parses just as {{P}} does (and ignores "html goes here" entirely), which preserves the property that the output of expandtemplates will mostly work when passed back to the parser.

-- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation

Daniel Kinzler

14 May 14 May

7:40 a.m.

Thanks all for the imput!

Am 14.05.2014 10:17, schrieb Gabriel Wicke:> On 05/13/2014 05:37 PM, Daniel Kinzler wrote:

...

It sounds like this won't work well with current Parsoid. We are using action=expandtemplates for the preprocessing of transclusions, and then parse the contents using Parsoid. The content is finally passed through the sanitizer to keep XSS at bay.

This means that HTML returned from the preprocessor needs to be valid in wikitext to avoid being stripped out by the sanitizer. Maybe that's actually possible, but my impression is that you are shooting for something that's closer to the behavior of a tag extension. Those already bypass the sanitizer, so would be less troublesome in the short term.

Yes. Just treat <html>...</html> like a tag extension, and it should work fine. Do you see any problems with that?

...

So it is important to think of renderers as services, so that they are usable from the content API and Parsoid. For existing PHP code this could even be action=parse, but for new renderers without a need or desire to tie themselves to MediaWiki internals I'd recommend to think of them as their own service. This can also make them more attractive to third party contributors from outside the MediaWiki world, as has for example recently happened with Mathoid.

True, but that has little to do with my patch. It just means that 3rd party Content objects should preferably implement getHtml() by calling out to a service object.

Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie):

...

To avoid the wikitext mangling, you could wrap it in some tag that works like <html> if $wgRawHtml is set and <pre> otherwise.

But <pre> will result in *escaped* HTML. That's just another kind of mangling. It's at all the "normal" result of parsing.

Basically, the <html> mode is for expandtemplates only, and not intended to be follow up by "actual" parsing.

Am 13.05.2014 21:38, schrieb Brad Jorsch (Anomie):

...

Or one step further, maybe a tag <foo wikitext="{{P}}">html goes here</foo> that parses just as {{P}} does (and ignores "html goes here" entirely), which preserves the property that the output of expandtemplates will mostly work when passed back to the parser.

Hm... that's an interesting idea, I'll think about it!

Btw, just so this is mentioned somewhere: it would be very easy to simply not expand such templates at all in expandtemplates mode, keeping them as {{T}} or [[T]].

Am 14.05.2014 00:11, schrieb Matthew Flaschen:

...

From working with Dan on this, the main issue is the ResourceLoader module that the diagrams require (it uses a JavaScript library called Vega, plus a couple supporting libraries, and simple MW setup code).

The container element that it needs can be as simple as:

<div data-something="..."></div>

which is actually valid wikitext.

So, there is no server side rendering at all? It's all done using JS on the client? Ok then, HTML transclusion isn't the solution.

...

Can you outline how RL modules would be handled in the transclusion scenario?

The current patch does not really address that problem, I'm afraid. I can think of two solutions:

* Create an SyntheticHtmlContent class that would hold meta info about modules etc, just like ParserOutput - perhaps it would just contain a ParserOutput object. And an equvalent SyntheticWikitextContent class, perhaps. That would allow us to pass such meta-info around as needed.

* Move the entire logic for HTML based transclusion into the wikitext parser, where it can just call getParserOutput() on the respective Content object. We would then no longer need the generic infrastructure for HTML transclusion. Maybe that would be a better solution in the end.

Hm... yes, I should make an alternative patch using that approach, so we can compare.

Thanks for your input! -- daniel

Gabriel Wicke

9:11 a.m.

On 05/14/2014 01:40 PM, Daniel Kinzler wrote:

...

...
This means that HTML returned from the preprocessor needs to be valid in wikitext to avoid being stripped out by the sanitizer. Maybe that's actually possible, but my impression is that you are shooting for something that's closer to the behavior of a tag extension. Those already bypass the sanitizer, so would be less troublesome in the short term.

Yes. Just treat <html>...</html> like a tag extension, and it should work fine. Do you see any problems with that?

First of all you'll have to make sure that users cannot inject <html> tags as that would enable arbitrary XSS. I might have missed it, but I believe that this is not yet done in your current patch.

In contrast to normal tag extensions <html> would also contain fully rendered HTML, and should not be piped through action=parse as is done in Parsoid for tag extensions (in absence of a direct tag extension expansion API end point). We and other users of the expandtemplates API will have to add special-case handling for this pseudo tag extension.

In HTML, the <html> tag is also not meant to be used inside the body of a page. I'd suggest using a different tag name to avoid issues with HTML parsers and potential name conflicts with existing tag extensions.

Overall it does not feel like a very clean way to do this. My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates. Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above.

...

...
So it is important to think of renderers as services, so that they are usable from the content API and Parsoid. For existing PHP code this could even be action=parse, but for new renderers without a need or desire to tie themselves to MediaWiki internals I'd recommend to think of them as their own service. This can also make them more attractive to third party contributors from outside the MediaWiki world, as has for example recently happened with Mathoid.

True, but that has little to do with my patch. It just means that 3rd party Content objects should preferably implement getHtml() by calling out to a service object.

You are right that it is not an immediate issue with your patch. The point is about the *longer-term* role of the ContentHandler vs. the content API. The ContentHandler could either try to be the central piece of our new content API, or could become an integration point that normally calls out to the content API and other services to retrieve HTML.

To me the latter is preferable as it enables us to optimize the content API for high request rates by concentrating on doing one job well, and lets us leverage this API from the server-side MediaWiki front-end through ContentHandler.

Gabriel

Daniel Kinzler

9:22 a.m.

Am 14.05.2014 15:11, schrieb Gabriel Wicke:

...

On 05/14/2014 01:40 PM, Daniel Kinzler wrote:

...
...
This means that HTML returned from the preprocessor needs to be valid in wikitext to avoid being stripped out by the sanitizer. Maybe that's actually possible, but my impression is that you are shooting for something that's closer to the behavior of a tag extension. Those already bypass the sanitizer, so would be less troublesome in the short term.

Yes. Just treat <html>...</html> like a tag extension, and it should work fine. Do you see any problems with that?

First of all you'll have to make sure that users cannot inject <html> tags as that would enable arbitrary XSS. I might have missed it, but I believe that this is not yet done in your current patch.

My patch doesn't change the handling of <html>...</html> by the parser. As before, the parser will pass HTML code in <html>...</html> through only if wgRawHtml is enabled, and will mangle/sanitize it otherwise.

My patch does mean however that the text return by expandtemplates may not render as expected when processed by the parser. Perhaps anomie's approach of preserving the original template call would work, something like:

Then, the parser could apply the normal expansion when encountering the tag, ignoring the pre-rendered HTML.

...

In contrast to normal tag extensions <html> would also contain fully rendered HTML, and should not be piped through action=parse as is done in Parsoid for tag extensions (in absence of a direct tag extension expansion API end point). We and other users of the expandtemplates API will have to add special-case handling for this pseudo tag extension.

Handling for the <html> tag should already be in place, since it's part of the core spec. The issue is only to know when to allow/trust such <html> tags, and when to treat them as plain text (or like a <pre> tag).

...

In HTML, the <html> tag is also not meant to be used inside the body of a page. I'd suggest using a different tag name to avoid issues with HTML parsers and potential name conflicts with existing tag extensions.

As above: <html> is part of the core syntax, to support $wgRawHtml. It's just disabled per default.

...

Overall it does not feel like a very clean way to do this. My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates.

The question is how to represent non-wikitext transclusions in the output of expandtemplates. We'll need an answer to this question in any case.

For the main purpose of my patch, expandtemplates is irrelevant. I added the special mode that generates <html> specifically to have a consistent wikitext representation for use by expandtemplates. I could simply disable it just as well, so no expansion would apply for such templates when calling expandtemplates (as is done for special page inclusiono).

...

Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above.

The content type did not change. It's wikitext.

-- daniel

Gabriel Wicke

10:04 a.m.

On 05/14/2014 03:22 PM, Daniel Kinzler wrote:

...

My patch doesn't change the handling of <html>...</html> by the parser. As before, the parser will pass HTML code in <html>...</html> through only if wgRawHtml is enabled, and will mangle/sanitize it otherwise.

Oh, I thought that you wanted to support normal wikis with $wgRawHtml disabled.

...

The content type did not change. It's wikitext.

Anything is wikitext ;)

Gabriel

Daniel Kinzler

15 May 15 May

10:42 a.m.

Am 14.05.2014 16:04, schrieb Gabriel Wicke:

...

On 05/14/2014 03:22 PM, Daniel Kinzler wrote:

...
My patch doesn't change the handling of <html>...</html> by the parser. As before, the parser will pass HTML code in <html>...</html> through only if wgRawHtml is enabled, and will mangle/sanitize it otherwise.

Oh, I thought that you wanted to support normal wikis with $wgRawHtml disabled.

I want to, and I do. <html> is not sued for normal rendering, it is used by expandtemplates only. During normal rendering, a strip mark is inserted, which will work on all wikis. The one thing that will not work on wikis with $wgRawHtml disabled is parsing the output of expandtemplates.

-- daniel

Gabriel Wicke

16 May 16 May

3:07 p.m.

On 05/15/2014 04:42 PM, Daniel Kinzler wrote:

...

The one thing that will not work on wikis with $wgRawHtml disabled is parsing the output of expandtemplates.

Yes, which means that it won't work with Parsoid, Flow, VE and other users.

I do think that we can do better, and I pointed out possible ways to do so in my earlier mail:

...

My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates. Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above.

Gabriel

Daniel Kinzler

17 May 17 May

3:54 a.m.

Am 16.05.2014 21:07, schrieb Gabriel Wicke:

...

On 05/15/2014 04:42 PM, Daniel Kinzler wrote:

...
The one thing that will not work on wikis with $wgRawHtml disabled is parsing the output of expandtemplates.

Yes, which means that it won't work with Parsoid, Flow, VE and other users.

And it has been fixed now. In the latest version, expandtemplates will just return {{Foo}} as it was if {{Foo}} can't be expanded to wikitext.

...

I do think that we can do better, and I pointed out possible ways to do so in my earlier mail:

...
My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates. Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above.

I don't quite understand what you are asking for... action=parse returns HTML, action=expandtemplates returns wikitext. The issue was with "mixed" output, that is, representing the expandion of templates that generate HTML in wikitext. The solution I'm going for no is to simply not expand them.

-- daniel

Subramanya Sastry

11:51 a.m.

(Top posting to quickly summarize what I gathered from the discussion and what would be required for Parsoid to expand pages with these transclusions).

Parsoid currently relies on the mediawiki API to preprocess transclusions and return wikitext (uses action=expandtemplates for this) which it then parses using native Parsoid pipeline. Parsoid processes extension tags via action=parse and weaves the result back into the top-level content of the page.

As per your original email, I am assuming the T is a page with a special content model that generates HTML and another page P has a transclusion {{T}}.

So, when Parsoid encounters {{T}}, it should be able to replace {{T}} with the HTML to generate the right parse output for P.

So, I am listing below 4 possible ways action=expandtemplates can process {{T}}

1. Your newest implementation (that just returns back {{T}}):

* If Parsoid gets back {{T}}, one of two things can happen: --- Parsoid, as usual, tries to parse it as wikitext, and it gets stuck in an infinite loop (query MW api for expansion of {{T}}, get back {{T}}, parse it as {{T}}, query MW api for expansion of {{T}}, .... ). So, this will definitely not work. --- Parsoid adds a special case check to see if the API sent back {{T}}, and in which case, requires a different API endpoint (action=expandtohtml maybe?) to send back the html expansion based on the assumption about output of expandtemplates. This would work and would require the new endpoint to be implemented, but feels hacky.

So, going back to your original implementation, here are at least 3 ways I see this working:

2. action=expandtemplates returns a <html>...</html> for the expansion of {{T}}, but also provides an additional API response header that tells Parsoid that T was a special content model page and that the raw HTML that it received should not be sanitized.

3. action=expandtemplates returns <html>...</html> for the expansion of {{T}} and no other indication about T being a special content model page or not. However, if Parsoid (and other clients) are to trust these html output always without sanitization, expandtemplates implementation should have a conditional sanitization of <html> tags encountered in wikitext to prevent XSS. As far as I understand, expandtemplates (on master, not your patch) does not do this tag sanitization. But, independent of that, what Parsoid and clients need is a guarantee that it is safe to blindly splice the contents of any <html>...</html> it receives for any {{T}} no matter whether what content model T implements.

4. Parsoid first queries the MW-api to find out the content model of T for every transclusion {{T}} it encounters on the page P and based on the content-model info, knows how to process the output of action=expandtemplates.

Clearly 4. is expensive and 3. seems hacky, but if it can be made to work, we can work with that.

But, both Gabriel and I think that solution 2. is the cleanest solution for now that would work. The PHP parser (in your patch to handle {{T}}) already has information about the content model of T when it is expanding {{T}} and it seems simplest and cleanest to return this information back to clients in the non-default content content-model expansions. That gives clients like Parsoid the cleanest way of handling these.

If I am missing something or this is unclear, and this getting into too much back and forth on email and it is simpler to discuss this on IRC, I can hop onto any IRC channel on Monday or we can do this on #mediawiki-parsoid, and one of us could later summarize the discussion back onto this thread.

Thanks, Subbu.

On 05/17/2014 02:54 AM, Daniel Kinzler wrote:

...

Am 16.05.2014 21:07, schrieb Gabriel Wicke:

...
On 05/15/2014 04:42 PM, Daniel Kinzler wrote:

...
The one thing that will not work on wikis with $wgRawHtml disabled is parsing the output of expandtemplates.

Yes, which means that it won't work with Parsoid, Flow, VE and other users.

And it has been fixed now. In the latest version, expandtemplates will just return {{Foo}} as it was if {{Foo}} can't be expanded to wikitext.

...
I do think that we can do better, and I pointed out possible ways to do so in my earlier mail:

...
My preference would be to let the consumer directly ask for pre-expanded wikitext *or* HTML, without overloading action=expandtemplates. Even indicating the content type explicitly in the API response (rather than inline with an HTML tag) would be a better stop-gap as it would avoid some of the security and compatibility issues described above.

I don't quite understand what you are asking for... action=parse returns HTML, action=expandtemplates returns wikitext. The issue was with "mixed" output, that is, representing the expandion of templates that generate HTML in wikitext. The solution I'm going for no is to simply not expand them.

-- daniel

Subramanya Sastry

11:57 a.m.

On 05/17/2014 10:51 AM, Subramanya Sastry wrote:

...

So, going back to your original implementation, here are at least 3 ways I see this working:

action=expandtemplates returns a <html>...</html> for the expansion

of {{T}}, but also provides an additional API response header that tells Parsoid that T was a special content model page and that the raw HTML that it received should not be sanitized.

Actually, the <html></html> wrapper is not even required here since the new API response header (for example, X-Content-Model: HTML) is sufficient to know what to do with the response body.

Subbu.

Gabriel Wicke

2:15 p.m.

On 05/17/2014 05:57 PM, Subramanya Sastry wrote:

...

On 05/17/2014 10:51 AM, Subramanya Sastry wrote:

...
So, going back to your original implementation, here are at least 3 ways I see this working:

action=expandtemplates returns a <html>...</html> for the expansion of

{{T}}, but also provides an additional API response header that tells Parsoid that T was a special content model page and that the raw HTML that it received should not be sanitized.

Actually, the <html></html> wrapper is not even required here since the new API response header (for example, X-Content-Model: HTML) is sufficient to know what to do with the response body.

Indeed.

Also, instead of the header we can just set a property / attribute in the JSON/XML response structure. This will also work for multi-part responses, for example when calling action=expandtemplates on multiple titles.

Gabriel

Daniel Kinzler

7:14 p.m.

Am 17.05.2014 17:57, schrieb Subramanya Sastry:

...

On 05/17/2014 10:51 AM, Subramanya Sastry wrote:

...
So, going back to your original implementation, here are at least 3 ways I see this working:

action=expandtemplates returns a <html>...</html> for the expansion of

{{T}}, but also provides an additional API response header that tells Parsoid that T was a special content model page and that the raw HTML that it received should not be sanitized.

Actually, the <html></html> wrapper is not even required here since the new API response header (for example, X-Content-Model: HTML) is sufficient to know what to do with the response body.

But that would only work if {{T}} was the whole text that was being expanded (I guess that's what you do with parsoid, right? Took me a minute to realize that). expandtemplates operates on full wikitext. If the input is something like

== Foo == {{T}}

[[Category:Bla}}

Then expanding {{T}} without a wrapper and pretending the result was HTML would just be wrong.

Regarding trusting the output: MediaWiki core trusts the generated HTML for direct output. It's no different from the HTML generated by e.g. special pages in that regard.

I think something like <html transclusion="{{T}}" model="whatever">...</html> would work best.

-- daniel

Subramanya Sastry

10:28 p.m.

On 05/17/2014 06:14 PM, Daniel Kinzler wrote:

...

Am 17.05.2014 17:57, schrieb Subramanya Sastry:

...
On 05/17/2014 10:51 AM, Subramanya Sastry wrote:

...
So, going back to your original implementation, here are at least 3 ways I see this working:

action=expandtemplates returns a <html>...</html> for the expansion of

{{T}}, but also provides an additional API response header that tells Parsoid that T was a special content model page and that the raw HTML that it received should not be sanitized.

Actually, the <html></html> wrapper is not even required here since the new API response header (for example, X-Content-Model: HTML) is sufficient to know what to do with the response body.

But that would only work if {{T}} was the whole text that was being expanded (I guess that's what you do with parsoid, right? Took me a minute to realize that). expandtemplates operates on full wikitext. If the input is something like

== Foo == {{T}}

[[Category:Bla}}

Then expanding {{T}} without a wrapper and pretending the result was HTML would just be wrong.

Parsoid handles this correctly. We have mechanisms for injecting HTML as well as wikitext into the toplevel page. For example, tag extensions currently return fully expanded html (we use action=parse API endpoint) and we inject that HTML into the page. So, consider this wikitext for page P.

== Foo == {{wikitext-transclusion}} *a1 <map ..> ... </map> *a2 {{T}} (the html-content-model-transclusion) *a3

Parsoid gets wikitext from the API for {{wikitext-transclusion}}, parses it and injects the tokens into the P's content. Parsoid gets HTML from the API for <map..>...</map> and injects the HTML into the not-fully-processed wikitext of P (by adding an appropriate token wrapper). So, if {{T}} returns HTML (i.e. the MW API lets Parsoid know that it is HTML), Parsoid can inject the HTML into the not-fully-processed wikitext and ensure that the final output comes out right (in this case, the HTML from both the map extension and {{T}} would not get sanitized as it should be).

Does that help explain why we said we don't need the html wrapper?

All that said, if you want to provide the wrapper with <html model="whatever" ....>fully-expanded-HTML</html>, we can handle that as well. We'll use the model attribute of the wrapper, discard the wrapper and use the contents in our pipeline.

So, model information either as an attribute on the wrapper, api response header, or a property in the JSON/XML response structure would all work for us. I don't have clarity on which of these three is the best mechanism for providing the template-page content-model information to clients .. so till such time I understand that better, I dont have an opinion about the specific mechanism. However, in his previous message, Gabriel indicated that a property in the JSON/XML response structure might work better for multi-part responses.

Subbu.

...

Regarding trusting the output: MediaWiki core trusts the generated HTML for direct output. It's no different from the HTML generated by e.g. special pages in that regard.

I think something like <html transclusion="{{T}}" model="whatever">...</html> would work best.

-- daniel

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gabriel Wicke

18 May 18 May

10:29 a.m.

On 05/18/2014 02:28 AM, Subramanya Sastry wrote:

...

However, in his previous message, Gabriel indicated that a property in the JSON/XML response structure might work better for multi-part responses.

The difference between wrapper and property is actually that using inline wrappers in the returned wikitext would force us to escape similar wrappers from normal template content to avoid opening a gaping XSS hole.

A separate property in the JSON/XML structure avoids the need for escaping (and associated security risks if not done thoroughly), and should be relatively straightforward to implement and consume.

Gabriel

Daniel Kinzler

19 May 19 May

5:52 a.m.

I'm getting the impression there is a fundamental misunderstanding here.

Am 18.05.2014 04:28, schrieb Subramanya Sastry:

...

So, consider this wikitext for page P.

== Foo == {{wikitext-transclusion}} *a1 <map ..> ... </map> *a2 {{T}} (the html-content-model-transclusion) *a3

Parsoid gets wikitext from the API for {{wikitext-transclusion}}, parses it and injects the tokens into the P's content. Parsoid gets HTML from the API for <map..>...</map> and injects the HTML into the not-fully-processed wikitext of P (by adding an appropriate token wrapper). So, if {{T}} returns HTML (i.e. the MW API lets Parsoid know that it is HTML), Parsoid can inject the HTML into the not-fully-processed wikitext and ensure that the final output comes out right (in this case, the HTML from both the map extension and {{T}} would not get sanitized as it should be).

Does that help explain why we said we don't need the html wrapper?

No, it actually misses my point completely. My point is that this may work with the way parsoid uses expandtemplates, but it does not work for expandtemplates in general. Because expandtemplates takes full wikitext as input, and only partially replaces it.

So, let me phrase it this way:

If expandtemplates is called with text=

== Foo == {{T}}

[[Category:Bla]]

What should it return, and what content type should be declared in the http header?

Note that I'm not talking about how parsoid processes this text. That's not my point - my point is that expandtemplates can be and is used on full wikitext. In that context, the return type cannot be HTML.

...

All that said, if you want to provide the wrapper with <html model="whatever" ....>fully-expanded-HTML</html>, we can handle that as well. We'll use the model attribute of the wrapper, discard the wrapper and use the contents in our pipeline.

Why use the model attribute? Why would you care about the original model? All you need to know is that you'll get HTML. Exposing the original model in this context seems useless if not misleading. <html transclude="{{T}}></html> would give that backend parser a way to discard the HTML (as unsafe) and execute the transclusion instead (generating trusted HTML). In fact, we could just omit the content of the <html> tag.

...

So, model information either as an attribute on the wrapper, api response header, or a property in the JSON/XML response structure would all work for us.

As explained above, the return type cannot be HTML for the full text, because any "plain" wikitext would stay unprocessed. There needs to be a marker for "html transclusion *here*" in the text.

Am 18.05.2014 16:29, schrieb Gabriel Wicke:

...

The difference between wrapper and property is actually that using inline wrappers in the returned wikitext would force us to escape similar wrappers from normal template content to avoid opening a gaping XSS hole.

Please explain, I do not see the hole you mention.

If the input contained <html>evil stuff</html>, it would just get escaped by the preprocessor (unless $wgRawHtml is enabled), as it is now: https://de.wikipedia.org/w/api.php?action=expandtemplates&text=%3Chtml%3...

If <html transclude="{{T}}"> was passed, the parser/preprocessor would treat it like it would treat {{T}} - it would get trusted, backend generated HTML from respective Content object.

I see no change, and no opportunity to inject anything. Am I missing something?

...

A separate property in the JSON/XML structure avoids the need for escaping (and associated security risks if not done thoroughly), and should be relatively straightforward to implement and consume.

As explained above, I do not see how this would work except for the very special case of using expandtemplates to expand just a single template. This could be solved by introducing a new, single template mode for expandtemplates, e.g. using expand="Foo|x|y|z" instead of text="{{Foo|x|y|z}}".

Another way would be to use hints the structure returned by generatexml. There, we have an opportunity to declare a content type for a *part* of the output (or rather, input).

-- daniel

Subramanya Sastry

8:21 a.m.

On 05/19/2014 04:52 AM, Daniel Kinzler wrote:

...

I'm getting the impression there is a fundamental misunderstanding here.

You are correct. I completely misunderstood what you said in your last response about expandtemplates. So, the rest of my response to your last email is irrelevant ... and let me reboot :-).

...

...
All that said, if you want to provide the wrapper with <html model="whatever" ....>fully-expanded-HTML</html>, we can handle that as well. We'll use the model attribute of the wrapper, discard the wrapper and use the contents in our pipeline.

Why use the model attribute? Why would you care about the original model? All you need to know is that you'll get HTML. Exposing the original model in this context seems useless if not misleading.

Given that I misunderstood your larger observation about expandtemplates, this is not relevant now. But, I was basing this on your proposal from the previous email which I'll now go back to.

On 05/17/2014 06:14 PM, Daniel Kinzler wrote:

...

I think something like <html transclusion="{{T}}" model="whatever">...</html> would work best.

I see what you are getting at here. Parsoid can treat this like a regular tag-extension and send it back to the api=parse endpoint for processing. Except if you provided the full expansion as the content of the html-wrapper in which case the extra api call can be skipped. The extra api call is not really an issue for occasional uses, but on pages with a lot of non-wikitext transclusion uses, this is an extra api call for each such use. I don't have a sense for how common this would be, so maybe that is a premature worry.

That said, for other clients, this content would be deadweight (if they are going to discard it and go back to the api=parse endpoint anyway or worse send back the entire response to the parser that is going to just discard it after the network transfer).

So, looks like there are some conflicting perf. requirements for different clients wrt expandtemplates response here. In that context, at least from a solely parsoid-centric point of view, the new api endpoint 'expand=Foo|x|y|z' you proposed would work well as well.

Subbu.

...

...
A separate property in the JSON/XML structure avoids the need for escaping (and associated security risks if not done thoroughly), and should be relatively straightforward to implement and consume.

As explained above, I do not see how this would work except for the very special case of using expandtemplates to expand just a single template. This could be solved by introducing a new, single template mode for expandtemplates, e.g. using expand="Foo|x|y|z" instead of text="{{Foo|x|y|z}}".

Another way would be to use hints the structure returned by generatexml. There, we have an opportunity to declare a content type for a *part* of the output (or rather, input).

Daniel Kinzler

11:29 a.m.

Am 19.05.2014 14:21, schrieb Subramanya Sastry:

...

On 05/19/2014 04:52 AM, Daniel Kinzler wrote:

...
I'm getting the impression there is a fundamental misunderstanding here.

You are correct. I completely misunderstood what you said in your last response about expandtemplates. So, the rest of my response to your last email is irrelevant ... and let me reboot :-).

Glad we got that out of the way :)

...

On 05/17/2014 06:14 PM, Daniel Kinzler wrote:

...
I think something like <html transclusion="{{T}}" model="whatever">...</html> would work best.

I see what you are getting at here. Parsoid can treat this like a regular tag-extension and send it back to the api=parse endpoint for processing. Except if you provided the full expansion as the content of the html-wrapper in which case the extra api call can be skipped. The extra api call is not really an issue for occasional uses, but on pages with a lot of non-wikitext transclusion uses, this is an extra api call for each such use. I don't have a sense for how common this would be, so maybe that is a premature worry.

I would probably go for always including the expanded HTML for now.

...

That said, for other clients, this content would be deadweight (if they are going to discard it and go back to the api=parse endpoint anyway or worse send back the entire response to the parser that is going to just discard it after the network transfer).

Yes. There could be an option to omit it. That makes the implementation more complex, but it's doable.

...

So, looks like there are some conflicting perf. requirements for different clients wrt expandtemplates response here. In that context, at least from a solely parsoid-centric point of view, the new api endpoint 'expand=Foo|x|y|z' you proposed would work well as well.

That seems the cleanest solution for the parsoid use case - however, the implementation is complicated by how parameter substitution works. For HTML based transclusion, it doesn't work at all at the moment - we would need tighter integration with the preprocessor for doing that.

Basically, there would be two cases: convert expand=Foo|x|y|z to {{Foo|x|y|z}} internally an call Parser::preprocess on that, so parameter subsitution is done correctly; or get the HTML from Foo, and discard the parameters. We would have to somehow know in advance which mode to use, handle the appropriate case, and then set the Content-Type header accordingly. Pretty messy...

I think <html transclusion="{{T}}"> is the simplest and most robust solution for now.

-- daniel

Gabriel Wicke

12:54 p.m.

On 05/19/2014 09:52 AM, Daniel Kinzler wrote:

...

Am 18.05.2014 16:29, schrieb Gabriel Wicke:

...
The difference between wrapper and property is actually that using inline wrappers in the returned wikitext would force us to escape similar wrappers from normal template content to avoid opening a gaping XSS hole.

Please explain, I do not see the hole you mention.

If the input contained <html>evil stuff</html>, it would just get escaped by the preprocessor (unless $wgRawHtml is enabled), as it is now: https://de.wikipedia.org/w/api.php?action=expandtemplates&text=%3Chtml%3...

What you see there is just unescaped HTML embedded in the XML result format. It's clearer that there's in fact no escaping on the HTML when looking at the JSON:

https://de.wikipedia.org/w/api.php?action=expandtemplates&text=%3Chtml%3...

Parsoid depends on there being no escaping for unknown tags (and known extension tags) in the preprocessor.

So if you use tags, you'll have to add escaping for those.

The move to HTML-based (self-contained) transclusions expansions will avoid this issue completely. That's a few months out though. Maybe we can find a stop-gap solution that moves in that direction, without introducing special tags in expandtemplates that we'll have to support for a long time.

Gabriel

Gabriel Wicke

1:19 p.m.

On 05/19/2014 04:54 PM, Gabriel Wicke wrote:

...

The move to HTML-based (self-contained) transclusions expansions will avoid this issue completely. That's a few months out though. Maybe we can find a stop-gap solution that moves in that direction, without introducing special tags in expandtemplates that we'll have to support for a long time.

Here's a proposal:

* Introduce a <domparse> extension tag that causes its content to be parsed all the way to a self-contained DOM structure. Example: <domparse>{{T}}</domparse>

* Emit this tag for HTML page transclusions. Avoids the security issue as there's no way to inject verbatim HTML. Works with Parsoid out of the box.

* Use <domparse> to support parsing unbalanced templates by inserting it into wikitext: <domparse> {{table-start}} {{table-row}} {{table-end}} </domparse>

* Build a solid HTML-only expansion API end point, and start using that for all transclusions that are not wrapped in <domparse>

* Stop wrapping non-wikitext transclusions into <domparse> in action=expandtemplates once those can be directly expanded to a self-contained DOM.

Gabriel

Gabriel Wicke

1:54 p.m.

On 05/19/2014 10:19 AM, Gabriel Wicke wrote:

...

On 05/19/2014 04:54 PM, Gabriel Wicke wrote:

...
The move to HTML-based (self-contained) transclusions expansions will avoid this issue completely. That's a few months out though. Maybe we can find a stop-gap solution that moves in that direction, without introducing special tags in expandtemplates that we'll have to support for a long time.

Here's a proposal:

Introduce a <domparse> extension tag that causes its content to be parsed

all the way to a self-contained DOM structure. Example: <domparse>{{T}}</domparse>

Emit this tag for HTML page transclusions. Avoids the security issue as

there's no way to inject verbatim HTML. Works with Parsoid out of the box.

Use <domparse> to support parsing unbalanced templates by inserting it

into wikitext:

<domparse> {{table-start}} {{table-row}} {{table-end}} </domparse>

Build a solid HTML-only expansion API end point, and start using that for

all transclusions that are not wrapped in <domparse>

Stop wrapping non-wikitext transclusions into <domparse> in

action=expandtemplates once those can be directly expanded to a self-contained DOM.

Here's a possible division of labor:

You (Daniel) could start with the second step (emitting the tag). Since not much escaping is needed (only nested <domparse> tags in the transclusion) this should be fairly straightforward.

We could work on the extension implementation (first bullet point) together, or tackle it completely on the Parsoid side. We planned to work on this in any case as part of our longer-term migration to well-balanced HTML transclusions.

The advantage of using <domparse> to support both unbalanced templates & special transclusions is that we'll only have to implement this once, and won't introduce another tag only to deprecate it fairly quickly. Phasing out unbalanced templates will take longer, as we'll first have to come up with alternative means to support the same use cases.

Gabriel

Bartosz Dziewoński

1:55 p.m.

I am kind of lost in this discussion, but let me just ask one question.

Won't all of the proposed solutions, other than the one of just not expanding transclusions that can't be expanded to wikitext, break the original and primary purpose of ExpandTemplates: providing valid parsable wikitext, for understanding by humans and for pasting back into articles in order to bypass transclusion limits?

I feel that Parsoid should be using a separate API for whatever it's doing with the wikitext. I'm sure that would give you more flexibility with internal design as well.

-- Matma Rex

Gabriel Wicke

2:01 p.m.

On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:

...

I am kind of lost in this discussion, but let me just ask one question.

Won't all of the proposed solutions, other than the one of just not expanding transclusions that can't be expanded to wikitext, break the original and primary purpose of ExpandTemplates: providing valid parsable wikitext, for understanding by humans and for pasting back into articles in order to bypass transclusion limits?

Yup. But that's the case with <domparse>, while it's not the case with <html> unless $wgRawHtml is true (which is impossible for publicly-editable wikis).

...

I feel that Parsoid should be using a separate API for whatever it's doing with the wikitext. I'm sure that would give you more flexibility with internal design as well.

We are moving towards that, but will still need to support unbalanced transclusions for a while. Since special transclusions can be nested inside of those we will need some form of inline support even if we expand most transclusions all the way to DOM with a different end point. Also, as Daniel pointed out, most other users are using action=expandtemplates for entire pages and expect that to work as well.

Gabriel

Daniel Kinzler

3:46 p.m.

Am 19.05.2014 20:01, schrieb Gabriel Wicke:

...

On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:

...
I am kind of lost in this discussion, but let me just ask one question.

Won't all of the proposed solutions, other than the one of just not expanding transclusions that can't be expanded to wikitext, break the original and primary purpose of ExpandTemplates: providing valid parsable wikitext, for understanding by humans and for pasting back into articles in order to bypass transclusion limits?

Yup. But that's the case with <domparse>, while it's not the case with

<html> unless $wgRawHtml is true (which is impossible for publicly-editable wikis).

<html transclusion="{{T}}"> would work transparently. It would contain HTML, for direct use by the client, and could be passed back to the parser, which would ignore the HTML and execute the transclusion. It should be 100% compatible with existing clients (unless the look for verbatim "<html>" for some reason).

I'll have to re-read Gabriel's <domparse> proposal tomorrow - right now, I don't see why it would be necessary, or how it would improve the situation.

...

...
I feel that Parsoid should be using a separate API for whatever it's doing with the wikitext. I'm sure that would give you more flexibility with internal design as well.

We are moving towards that, but will still need to support unbalanced transclusions for a while.

But for HTML based transclusions you could ignore that - you could already resolve these using a separate API call, if needed.

But still - I do not see why that would be necessary. If expandtemplates returns <html transclusion="{{T}}">, clients can pass that back to the parser safely, or use the contained HTML directly, safely.

Parsoid would keep working as before: it would treat <html> as a tag extension (it does that, right?) and pass it back to the parser (which would expand it again, this time fully, if action=parse is used). If parsoid knows about the special properties of <html>, it could just use the contents verbatim - I see no reason why that would be any more unsafe as any other HTML returned by the parser.

But perhaps I'm missing something obvious. I'll re-read the proposal tomorrow.

-- daniel

Gabriel Wicke

5:05 p.m.

On 05/19/2014 12:46 PM, Daniel Kinzler wrote:

...

Am 19.05.2014 20:01, schrieb Gabriel Wicke:

...
On 05/19/2014 10:55 AM, Bartosz Dziewoński wrote:

...
I am kind of lost in this discussion, but let me just ask one question.

Won't all of the proposed solutions, other than the one of just not expanding transclusions that can't be expanded to wikitext, break the original and primary purpose of ExpandTemplates: providing valid parsable wikitext, for understanding by humans and for pasting back into articles in order to bypass transclusion limits?

Yup. But that's the case with <domparse>, while it's not the case with

<html> unless $wgRawHtml is true (which is impossible for publicly-editable wikis).

<html transclusion="{{T}}"> would work transparently. It would contain HTML, for direct use by the client, and could be passed back to the parser, which would ignore the HTML and execute the transclusion. It should be 100% compatible with existing clients (unless the look for verbatim "<html>" for some reason).

Currently <html> tags are escaped when $wgRawHtml is disabled. We could change the implementation to stop doing so *iff* the transclusion parameter is supplied, but IMO that would be fairly unexpected and inconsistent behavior.

...

...
...
I feel that Parsoid should be using a separate API for whatever it's doing with the wikitext. I'm sure that would give you more flexibility with internal design as well.

We are moving towards that, but will still need to support unbalanced transclusions for a while.

But for HTML based transclusions you could ignore that - you could already resolve these using a separate API call, if needed.

Yes, and they are going to be the common case once we have marked up the exceptions with tags like <domparse>. As you correctly pointed out, inline tags are primarily needed for expandtemplates calls on compound content, which we need to do as long as we support unbalanced templates. We can't know a priori whether some transclusions in turn transclude special HTML content.

I think we have agreement that some kind of tag is still needed. The main point still under discussion is on which tag to use, and how to implement this tag in the parser.

Originally, <domparse> was conceived to be used in actual page content to wrap wikitext that is supposed to be parsed to a balanced DOM *as a unit* rather than transclusion by transclusion. Once unbalanced compound transclusion content is wrapped in <domparse> tags (manually or via bots using Parsoid info), we can start to enforce nesting of all other transclusions by default. This will make editing safer and more accurate, and improve performance by letting us reuse expansions and avoid re-rendering the entire page during refreshLinks. See https://bugzilla.wikimedia.org/show_bug.cgi?id=55524 for more background.

The use of <domparse> to mark up special HTML transclusions in expandtemplates output will be temporary (until HTML transclusions are the default), but even if such output is pasted into the actual wikitext it would be harmless, and would work as expected.

Now back to the syntax. Encoding complex transclusions in a HTML parameter would be rather cumbersome, and would entail a lot of attribute-specific escaping. Wrapping such transclusions in <domparse> tags on the other hand normally does not entail any escaping, as only nested <domparse> tags are problematic.

...

Parsoid would keep working as before: it would treat <html> as a tag extension (it does that, right?)

$wgRawHtml is disabled in all wikis we are currently interested in. MediaWiki does properly report the <html> extension tag from siteinfo when $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis. It will be harder to support the <html transclusion="<transclusions>"></html> exception.

Gabriel

Daniel Kinzler

20 May 20 May

5:46 a.m.

Am 19.05.2014 23:05, schrieb Gabriel Wicke:

...

I think we have agreement that some kind of tag is still needed. The main point still under discussion is on which tag to use, and how to implement this tag in the parser.

Indeed.

...

Originally, <domparse> was conceived to be used in actual page content to wrap wikitext that is supposed to be parsed to a balanced DOM *as a unit* rather than transclusion by transclusion. Once unbalanced compound transclusion content is wrapped in <domparse> tags (manually or via bots using Parsoid info), we can start to enforce nesting of all other transclusions by default. This will make editing safer and more accurate, and improve performance by letting us reuse expansions and avoid re-rendering the entire page during refreshLinks. See https://bugzilla.wikimedia.org/show_bug.cgi?id=55524 for more background.

Ah, I though you just pulled that out of your hat :)

My main reason for recycling the <html> tag was to not introduce a new tag extension. <domparse> may occur verbatim in existing wikitext, and would break when the tag is introduces.

Other than that, I'm find with outputting whatever tag you like for the transclusion. Implementing the tag is something else, though - I could implement it so it will work for HTML transclusion, but I'm not sure I understand the original domparse stuff well enough to get that right. Would domparse be in core, btw?

...

Now back to the syntax. Encoding complex transclusions in a HTML parameter would be rather cumbersome, and would entail a lot of attribute-specific escaping.

Why would it involve any escaping? It should be handled as a tag extension, like any other.

...

$wgRawHtml is disabled in all wikis we are currently interested in. MediaWiki does properly report the <html> extension tag from siteinfo when $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis. It will be harder to support the <html transclusion="<transclusions>"></html> exception.

I should try what expandtemplates does with <html> with $wgRawHtml enabled. Nothing, probably. It will just come back containing raw HTML. Which would be fine, I think.

By the way: once we agree on a mechanism, it would be trivial to use the same mechanism for special page transclusion. My patch actually already covers that. Do you agree that this is the Right Thing? It's just transclusion of HTML content, after all.

-- daniel

Gabriel Wicke

4:31 p.m.

On 05/20/2014 02:46 AM, Daniel Kinzler wrote:

...

My main reason for recycling the <html> tag was to not introduce a new tag extension. <domparse> may occur verbatim in existing wikitext, and would break when the tag is introduces.

The only existing mentions of this are probably us discussing it ;) In any case, it's easy to grep for it & nowikify existing uses.

...

Other than that, I'm find with outputting whatever tag you like for the transclusion.

Great!

...

Implementing the tag is something else, though - I could implement it so it will work for HTML transclusion, but I'm not sure I understand the original domparse stuff well enough to get that right. Would domparse be in core, btw?

Yes, it should be in core. I believe that a very simple implementation (without actual DOM balancing, using Parser::recursiveTagParse()) would not be too hard. The guts of it are described in [1]. The limitations of recursiveTagParse should not matter much for this use case.

...

...
Now back to the syntax. Encoding complex transclusions in a HTML parameter would be rather cumbersome, and would entail a lot of attribute-specific escaping.

Why would it involve any escaping? It should be handled as a tag extension, like any other.

Transclusions can contain quotes, which need to be escaped in attribute values to make sure that the attribute is in fact an attribute. Since quotes tend to be more common than <domparse> tags this means that there's going to be more escaping. I also find it harder to scan for quotes ending a long attribute value. Tags are easier to spot.

...

...
$wgRawHtml is disabled in all wikis we are currently interested in. MediaWiki does properly report the <html> extension tag from siteinfo when $wgRawHtml is enabled, so it ought to work with Parsoid for private wikis. It will be harder to support the <html transclusion="<transclusions>"></html> exception.

I should try what expandtemplates does with <html> with $wgRawHtml enabled. Nothing, probably. It will just come back containing raw HTML. Which would be fine, I think.

Yes, that case will work. But $wgRawHtml enabled is the exception, and not something I'd like to encourage.

...

By the way: once we agree on a mechanism, it would be trivial to use the same mechanism for special page transclusion. My patch actually already covers that. Do you agree that this is the Right Thing? It's just transclusion of HTML content, after all.

Yes, that sounds good to me.

Gabriel

[1]: https://www.mediawiki.org/wiki/Manual:Tag_extensions#How_do_I_render_wikitex...

Dan Andreescu

14 May 14 May

12:01 p.m.

New subject: Transcluding non-text content as HTML on wikitext pages

...

...
Can you outline how RL modules would be handled in the transclusion scenario?

The current patch does not really address that problem, I'm afraid. I can think of two solutions:

Create an SyntheticHtmlContent class that would hold meta info about

modules etc, just like ParserOutput - perhaps it would just contain a ParserOutput object. And an equvalent SyntheticWikitextContent class, perhaps. That would allow us to pass such meta-info around as needed.

Move the entire logic for HTML based transclusion into the wikitext

parser, where it can just call getParserOutput() on the respective Content object. We would then no longer need the generic infrastructure for HTML transclusion. Maybe that would be a better solution in the end.

Hm... yes, I should make an alternative patch using that approach, so we can compare.

Thanks a lot Daniel, I'm happy to help test / try out any solutions you want to experiment with. I've moved my work to gerrit: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Limnand the last commit (with a lot of help from Matt F.) may be ready for you to use as a use case. Let me know if it'd be helpful to install this somewhere in labs.

Matthew Flaschen

13 May 13 May

6:11 p.m.

On 05/13/2014 11:37 AM, Daniel Kinzler wrote:

...

Hi all!

During the hackathon, I worked on a patch that would make it possible for non-textual content to be included on wikitext pages using the template syntax. The idea is that if we have a content handler that e.g. generates awesome diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be able to use that output on a wiki page. But until now, that would have required the content handler to generate wikitext for the transclusion - not easily done.

From working with Dan on this, the main issue is the ResourceLoader module that the diagrams require (it uses a JavaScript library called Vega, plus a couple supporting libraries, and simple MW setup code).

The container element that it needs can be as simple as:

which is actually valid wikitext.

Can you outline how RL modules would be handled in the transclusion scenario?

Matt Flaschen

Gabriel Wicke

14 May 14 May

4:17 a.m.

On 05/13/2014 05:37 PM, Daniel Kinzler wrote:

...

Hi all!

During the hackathon, I worked on a patch that would make it possible for non-textual content to be included on wikitext pages using the template syntax. The idea is that if we have a content handler that e.g. generates awesome diagrams from JSON data, like the extension Dan Andreescu wrote, we want to be able to use that output on a wiki page. But until now, that would have required the content handler to generate wikitext for the transclusion - not easily done.

It sounds like this won't work well with current Parsoid. We are using action=expandtemplates for the preprocessing of transclusions, and then parse the contents using Parsoid. The content is finally passed through the sanitizer to keep XSS at bay.

This means that HTML returned from the preprocessor needs to be valid in wikitext to avoid being stripped out by the sanitizer. Maybe that's actually possible, but my impression is that you are shooting for something that's closer to the behavior of a tag extension. Those already bypass the sanitizer, so would be less troublesome in the short term. We currently also can't process transclusions independently to HTML, as we still have to support unbalanced templates. We are moving into that direction though, which should also make it easier to support non-wikitext transclusion content.

In the longer team, Parsoid will request pre-sanitized and balanced HTML from the content API [1,2] for everything but unbalanced wikitext content [3]. The content API will treat it like any other request, and ask the storage service for the HTML. If that's found, then it is directly returned and no rendering happens. This is going to be the typical and fast case. If there is however no HTML in storage for that revision the content API will just call the renderer service and save the HTML back / return it to clients like Parsoid.

So it is important to think of renderers as services, so that they are usable from the content API and Parsoid. For existing PHP code this could even be action=parse, but for new renderers without a need or desire to tie themselves to MediaWiki internals I'd recommend to think of them as their own service. This can also make them more attractive to third party contributors from outside the MediaWiki world, as has for example recently happened with Mathoid.

Gabriel

[1]: https://www.mediawiki.org/wiki/Requests_for_comment/Content_API [2]: https://github.com/gwicke/restface [3]: We are currently mentoring a GSoC project to collect statistics on issues like unbalanced templates, which should allow us to systematically mark those transclusions by wrapping them in a <domparse> tag in wikitext. All transclusions outside of <domparse> will then be expected to yield stand-alone HTML.

Daniel Kinzler

16 May 16 May

12:14 p.m.

Hi again!

I have rewritten the patch that enabled HTML based transclusion:

https://gerrit.wikimedia.org/r/#/c/132710/

I tried to address the concerns raised about my previous attempt, namely, how HTML based transclusion is handled in expandtemplates, and how page meta data such as resource modules get passed from the transcluded content to the main parser output (this should work now).

For expandtemplates, I decided to just keep HTML based transclusions as they are - including special page transclusions. So, expandtemplates will simply leave {{Special:Foo}} and {{MediaWiki:Foo.js}} in the expanded text, while in the xml output, you can still see them as template calls.

Cheers, Daniel

3839

Age (days ago)

3846

Last active (days ago)

wikitech-l@lists.wikimedia.org

30 comments

7 participants

tags (0)

participants (7)

Bartosz Dziewoński
Brad Jorsch (Anomie)
Dan Andreescu
Daniel Kinzler
Gabriel Wicke
Matthew Flaschen
Subramanya Sastry