Hello,
Recently I'm researching Parsoid's design as MW is migrating to Parsoid. I found out that due to its single-pass tokenizing design, templates are not handled textually as the legacy parser does.
This is good as the HTML now have information about which template they are transcluded from. However, https://www.mediawiki.org/wiki/Parsoid/limitations says "We have since decided to use the PHP preprocessor for template expansions, which side-steps these issues by reverting to the traditional textual preprocessor pass". Is this still true now?
Best regards, Diskdance
Our primary goal with Parsoid today is to ensure maximum compatibility with the current default parser -- without that, it would be impossible to switch over to Parsoid for all page rendering use cases.
But, at the core, Parsoid's design has always pursued a processing model where content (fragment) generators (whether templates, extensions, parser functions, or in the future wiki functions or other page components) are decoupled from the page where they are embedded. This lets us process them independently and incorporate those generated fragments efficiently. Parsoid uses this model for extensions already. But, that model hasn't held up for templates as they are implemented today because of how they are used and what they generate (snippets of text that can be full or partial attributes, mix of attributes and content, parts of tables) -- table use cases being the most egregious of those.
So given these practical realities, the simplest course of action for us to handle templates today is to have them be fully expanded as textual strings and do additional processing within Parsoid. But, Parsoid still is able to clearly demarcate page content that comes from templates (and other content generators) even where the template content combines with page level content in some complex ways (some caused by table content markup errors causing content fostering -- a source of unnecessary complexity and headaches for us).
Our goal is to start moving towards the original decoupled processing model for templates as well, but only after we are able to switch over to Parsoid more fully and that is looking closer than ever at this point. But, that is going to be a gradual evolution -- there are various proposals we have considered in the past here, but typing is probably the overarching concept that ties all those ideas together.
Hope that answers your primary question. Some additional tangential details below while I am at it.
<tangent>All that said, I wouldn't invest too much time analyzing the contents of that page and the notions of single-pass or multi-pass or PEG vs not-PEG, etc. Those are somewhat immaterial implementation details. I am not sure I would describe Parsoid as a single-pass model today. It is single-pass in only so far as it processes the textual string in one pass. But, otherwise, the generated tokens are processed multiple times as they are transformed. The DOM that is built up is processed multiple times ... so, if anything, Parsoid has a lot more (20+) passes. Separately, given that we cannot really process the wikitext stream to a fully processed semantic tree (because of the nature of wikitext), we could have used other ways of generating tokens along with corresponding token transformers to get the same end result. Since it is mostly water under the bridge now, we haven't really investigated the route of how this might have looked if we had used traditional LALR techniques (as long as we realize the output of that grammar would just be a different set of tokens, not a conventional AST). I am mostly mentioning this tangent to emphasize that our goal here is not to arrive at a formal (implementation) grammar in the traditional programming language sense, but rather to transition to a different (decoupled / typed) processing model while preserving compatibility in the interim and while giving us feasible migration paths to that model.</tangent>
Subbu.
On 2/16/24 23:10, psnbaotg wrote:
Hello,
Recently I'm researching Parsoid's design as MW is migrating to Parsoid. I found out that due to its single-pass tokenizing design, templates are not handled textually as the legacy parser does.
This is good as the HTML now have information about which template they are transcluded from. However, https://www.mediawiki.org/wiki/Parsoid/limitations%C2%A0says "We have since decided to use the PHP preprocessor for template expansions, which side-steps these issues by reverting to the traditional textual preprocessor pass". Is this still true now?
Best regards, Diskdance
Wikitext-l mailing list --wikitext-l@lists.wikimedia.org To unsubscribe send an email towikitext-l-leave@lists.wikimedia.org
[ Resending since I forgot to copy all lists -- please don't mind the duplicate response on wikitext-l. ]
Our primary goal with Parsoid today is to ensure maximum compatibility with the current default parser -- without that, it would be impossible to switch over to Parsoid for all page rendering use cases.
But, at the core, Parsoid's design has always pursued a processing model where content (fragment) generators (whether templates, extensions, parser functions, or in the future wiki functions or other page components) are decoupled from the page where they are embedded. This lets us process them independently and incorporate those generated fragments efficiently. Parsoid uses this model for extensions already. But, that model hasn't held up for templates as they are implemented today because of how they are used and what they generate (snippets of text that can be full or partial attributes, mix of attributes and content, parts of tables) -- table use cases being the most egregious of those.
So given these practical realities, the simplest course of action for us to handle templates today is to have them be fully expanded as textual strings and do additional processing within Parsoid. But, Parsoid still is able to clearly demarcate page content that comes from templates (and other content generators) even where the template content combines with page level content in some complex ways (some caused by table content markup errors causing content fostering -- a source of unnecessary complexity and headaches for us).
Our goal is to start moving towards the original decoupled processing model for templates as well, but only after we are able to switch over to Parsoid more fully and that is looking closer than ever at this point. But, that is going to be a gradual evolution -- there are various proposals we have considered in the past here, but typing is probably the overarching concept that ties all those ideas together.
Hope that answers your primary question. Some additional tangential details below while I am at it.
<tangent>All that said, I wouldn't invest too much time analyzing the contents of that page and the notions of single-pass or multi-pass or PEG vs not-PEG, etc. Those are somewhat immaterial implementation details. I am not sure I would describe Parsoid as a single-pass model today. It is single-pass in only so far as it processes the textual string in one pass. But, otherwise, the generated tokens are processed multiple times as they are transformed. The DOM that is built up is processed multiple times ... so, if anything, Parsoid has a lot more (20+) passes. Separately, given that we cannot really process the wikitext stream to a fully processed semantic tree (because of the nature of wikitext), we could have used other ways of generating tokens along with corresponding token transformers to get the same end result. Since it is mostly water under the bridge now, we haven't really investigated the route of how this might have looked if we had used traditional LALR techniques (as long as we realize the output of that grammar would just be a different set of tokens, not a conventional AST). I am mostly mentioning this tangent to emphasize that our goal here is not to arrive at a formal (implementation) grammar in the traditional programming language sense, but rather to transition to a different (decoupled / typed) processing model while preserving compatibility in the interim and while giving us feasible migration paths to that model.</tangent>
Subbu.
On 2/16/24 23:10, psnbaotg wrote:
Hello,
Recently I'm researching Parsoid's design as MW is migrating to Parsoid. I found out that due to its single-pass tokenizing design, templates are not handled textually as the legacy parser does.
This is good as the HTML now have information about which template they are transcluded from. However, https://www.mediawiki.org/wiki/Parsoid/limitations%C2%A0says "We have since decided to use the PHP preprocessor for template expansions, which side-steps these issues by reverting to the traditional textual preprocessor pass". Is this still true now?
Best regards, Diskdance
Wikitext-l mailing list --wikitext-l@lists.wikimedia.org To unsubscribe send an email towikitext-l-leave@lists.wikimedia.org
wikitext-l@lists.wikimedia.org