An interesting idea just popped into my head, as a combination of my explorations through the dom preprocessor and my attempt at deferring editsection replacement till after parsing is done so that skins can modify the markup used in an editsection link in a skin-specific way without breaking things, and so that we can stop fragmenting the parser cache by user language just for edit section links.
A postprocessor. It would be quite interesting if instead of html we started outputting something like this in our parser output: <root><html><p>foo</p><h2></html><editsection page="Foo" section="1">bar</editsection><html>bar</h2><p>baz</p><h2></html><choose><option><html><p>foo</p></html></option><option><html><p>bar</p></html></option><option><html><p>baz</p></html></option></choose></root> ((don't get scarred off by all the entities, this is nothing new, try looking at a preprocess-xml cache entry))
Course this is a Postprocessor_DOM oriented look, like Preprocessor_Hash we'd have a Postprocessor_Hash and it would store a different format like we already do with Preprocessor_Hash (serialized?).
The idea being the creation of new markers that aren't 100% parsed but are outputted in a easy to deserialize format and finish parsing with minimal work and extensions can output and have a postprocessor hook expand later on. In essence the idea here is two fold. Firstly things like the <editsection page="Foo" section="1">bar</editsection> I tried to introduce now is no longer a hack. And we can try to start deferring minimal processing cost things which fragment the parser cache if they aren't needed. Ideally in the future if something like {{int:asdf}} isn't used in a [[]] or in a parser function and is just a base level bit of display isolated from the rest of the WikiText we might be able to output it in a way that we don't have to fragment the cache by user lang but can still output the message in the user's lang by deferring it. And as a big extra bonus, think of the RandomSelection extension. Right now extensions like RandomSelection end up disabling the entire parser cache for a page, just so they can output a random one of a series of options. With a postprocessor they could instead craft partially parsed output where all the normal wikitext is still parsed, but all the options given in the source text are outputted and the postprocessor handles the actual random selection on each page view, only outputting one of the three html nodes. Likewise we might be able to implement "Welcome {{USERNAME}}!" without fragmenting the cache by user or having to disable it.
The key being that we get things as variable as complete randomness, at the level of re-executing that randomness on each page view, yet have barely any more processing to do than we did before. (like the rest of the ui that isn't part of the page content)
Adding yet another discreet parsing step is the reverse of what a lot of people hoping to clean up wikitext are heading towards.
What some of us have been kicking around would be migrating away from pre-procesing the text at all. Instead the text should be parsed in a single step into an intermediate structure that is neither wikitext nor HTML. Templates would be required to return whole structures when expanded (open what you close, close what you open) and would only be present in sanitary places (not in the middle of wiki or HTML syntax for instance).
Once the document is in this intermediate structure, it would still contain enough information about where it came from to make a round trip without a dirty diff. Alternatively (and more usefully) template elements would be expanded and the resulting structure would be renderable into a variety of output formats such as HTML, PDF, a lightweight version of the HTML (for mobile devices) or even plain text.
Because the rendered output can be much more configurable, structured and regular than our current output, it would be more reasonable to perform additional transformations on it if a skin needed to.
- Trevor
On Jan 31, 2011, at 1:31 PM, Daniel Friesen wrote:
An interesting idea just popped into my head, as a combination of my explorations through the dom preprocessor and my attempt at deferring editsection replacement till after parsing is done so that skins can modify the markup used in an editsection link in a skin-specific way without breaking things, and so that we can stop fragmenting the parser cache by user language just for edit section links.
A postprocessor. It would be quite interesting if instead of html we started outputting something like this in our parser output: <root><html><p>foo</p><h2></html><editsection page="Foo" section="1">bar</editsection><html>bar</h2><p>baz</p><h2></html><choose><option><html><p>foo</p></html></option><option><html><p>bar</p></html></option><option><html><p>baz</p></html></option></choose></root> ((don't get scarred off by all the entities, this is nothing new, try looking at a preprocess-xml cache entry))
Course this is a Postprocessor_DOM oriented look, like Preprocessor_Hash we'd have a Postprocessor_Hash and it would store a different format like we already do with Preprocessor_Hash (serialized?).
The idea being the creation of new markers that aren't 100% parsed but are outputted in a easy to deserialize format and finish parsing with minimal work and extensions can output and have a postprocessor hook expand later on. In essence the idea here is two fold. Firstly things like the <editsection page="Foo" section="1">bar</editsection> I tried to introduce now is no longer a hack. And we can try to start deferring minimal processing cost things which fragment the parser cache if they aren't needed. Ideally in the future if something like {{int:asdf}} isn't used in a [[]] or in a parser function and is just a base level bit of display isolated from the rest of the WikiText we might be able to output it in a way that we don't have to fragment the cache by user lang but can still output the message in the user's lang by deferring it. And as a big extra bonus, think of the RandomSelection extension. Right now extensions like RandomSelection end up disabling the entire parser cache for a page, just so they can output a random one of a series of options. With a postprocessor they could instead craft partially parsed output where all the normal wikitext is still parsed, but all the options given in the source text are outputted and the postprocessor handles the actual random selection on each page view, only outputting one of the three html nodes. Likewise we might be able to implement "Welcome {{USERNAME}}!" without fragmenting the cache by user or having to disable it.
The key being that we get things as variable as complete randomness, at the level of re-executing that randomness on each page view, yet have barely any more processing to do than we did before. (like the rest of the ui that isn't part of the page content)
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Usually I don't reply that much on other threads, but for this: +1.
Jan Paul
On 31-Jan-2011, at 22:55, Trevor Parscal wrote:
Adding yet another discreet parsing step is the reverse of what a lot of people hoping to clean up wikitext are heading towards.
What some of us have been kicking around would be migrating away from pre-procesing the text at all. Instead the text should be parsed in a single step into an intermediate structure that is neither wikitext nor HTML. Templates would be required to return whole structures when expanded (open what you close, close what you open) and would only be present in sanitary places (not in the middle of wiki or HTML syntax for instance).
Once the document is in this intermediate structure, it would still contain enough information about where it came from to make a round trip without a dirty diff. Alternatively (and more usefully) template elements would be expanded and the resulting structure would be renderable into a variety of output formats such as HTML, PDF, a lightweight version of the HTML (for mobile devices) or even plain text.
Because the rendered output can be much more configurable, structured and regular than our current output, it would be more reasonable to perform additional transformations on it if a skin needed to.
- Trevor
On Jan 31, 2011, at 1:31 PM, Daniel Friesen wrote:
An interesting idea just popped into my head, as a combination of my explorations through the dom preprocessor and my attempt at deferring editsection replacement till after parsing is done so that skins can modify the markup used in an editsection link in a skin-specific way without breaking things, and so that we can stop fragmenting the parser cache by user language just for edit section links.
A postprocessor. It would be quite interesting if instead of html we started outputting something like this in our parser output: <root><html><p>foo</p><h2></html><editsection page="Foo" section="1">bar</editsection><html>bar</h2><p>baz</p><h2></html><choose><option><html><p>foo</p></html></option><option><html><p>bar</p></html></option><option><html><p>baz</p></html></option></choose></root> ((don't get scarred off by all the entities, this is nothing new, try looking at a preprocess-xml cache entry))
Course this is a Postprocessor_DOM oriented look, like Preprocessor_Hash we'd have a Postprocessor_Hash and it would store a different format like we already do with Preprocessor_Hash (serialized?).
The idea being the creation of new markers that aren't 100% parsed but are outputted in a easy to deserialize format and finish parsing with minimal work and extensions can output and have a postprocessor hook expand later on. In essence the idea here is two fold. Firstly things like the <editsection page="Foo" section="1">bar</editsection> I tried to introduce now is no longer a hack. And we can try to start deferring minimal processing cost things which fragment the parser cache if they aren't needed. Ideally in the future if something like {{int:asdf}} isn't used in a [[]] or in a parser function and is just a base level bit of display isolated from the rest of the WikiText we might be able to output it in a way that we don't have to fragment the cache by user lang but can still output the message in the user's lang by deferring it. And as a big extra bonus, think of the RandomSelection extension. Right now extensions like RandomSelection end up disabling the entire parser cache for a page, just so they can output a random one of a series of options. With a postprocessor they could instead craft partially parsed output where all the normal wikitext is still parsed, but all the options given in the source text are outputted and the postprocessor handles the actual random selection on each page view, only outputting one of the three html nodes. Likewise we might be able to implement "Welcome {{USERNAME}}!" without fragmenting the cache by user or having to disable it.
The key being that we get things as variable as complete randomness, at the level of re-executing that randomness on each page view, yet have barely any more processing to do than we did before. (like the rest of the ui that isn't part of the page content)
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Jan 31, 2011 at 4:55 PM, Trevor Parscal tparscal@wikimedia.org wrote:
Adding yet another discreet parsing step is the reverse of what a lot of people hoping to clean up wikitext are heading towards.
What system do you propose that would retain the performance benefits of this suggestion, and be deployable in the near future? A simple postprocessor would be very useful -- you could save greatly on parser cache fragmentation, if not eliminate it entirely. E.g., as Daniel notes, you could leave a marker in the parser output where section links should go, and have the postprocessor fill it in depending on user language, so we don't fragment the cache for language. More importantly, a postprocessor would allow us to add new features that are currently unacceptable due to cache fragmentation.
What some of us have been kicking around would be migrating away from pre-procesing the text at all. Instead the text should be parsed in a single step into an intermediate structure that is neither wikitext nor HTML. Templates would be required to return whole structures when expanded (open what you close, close what you open) and would only be present in sanitary places (not in the middle of wiki or HTML syntax for instance).
This is possibly a good long-term goal, but I don't see how it conflicts with a postprocessing step at all. As long as parsing large pages requires significant CPU time, we'll want to cache the parsed output as much as possible, and a postprocessor will always help to reduce cache fragmentation. If we ever do move to a storage format that's so fast to process that we don't care about cache misses, of course, we could scrap the preprocessor and incorporate its effects into the main pass, no harm done.
Daniel Friesen wrote:
An interesting idea just popped into my head, as a combination of my explorations through the dom preprocessor and my attempt at deferring editsection replacement till after parsing is done so that skins can modify the markup used in an editsection link in a skin-specific way without breaking things, and so that we can stop fragmenting the parser cache by user language just for edit section links. A postprocessor.
You're approaching the dark side, Luke. :)
And we can try to start deferring minimal processing cost things which fragment the parser cache if they aren't needed. Ideally in the future if something like {{int:asdf}} isn't used in a [[]] or in a parser function and is just a base level bit of display isolated from the rest of the WikiText we might be able to output it in a way that we don't have to fragment the cache by user lang but can still output the message in the user's lang by deferring it.
{{int: }} inside links corrupting tables is solved in 1.17. {{int:}} inside a non-taken branch is fixed, too.
I have been thinking for some time on adding a postprocessing step for stub links, though.
wikitech-l@lists.wikimedia.org