I pulled out most of the parser-y parts from the parserTests, leaving behind just tests.
However, the parser is still a bit of a monster object, hence the deliberately silly name, ParserThingy.
I'm trying to decompose it into a chain, roughly like:
1. wikiText -> tokenize -> tokens
2. tokens -> treeBuilder -> dom-tree
3. dom-tree -> serialize -> wikiDom or HTML
There's a "postprocess" step as well, but I think that makes sense as part of step 2.
Each step should be individually testable. And the whole enchilada is like a big composition of initialized objects, e.g.
wikiTextToWikiDom = new wikiTextToSerialization( new wikiTextTokenizer(tokenConfig), new domTreeBuilder(treeConfig), new domTreeToWikiDom(serializationConfig) );
var wikiDom = wikiTextToWikiDom( wikitext );
Just a query on what interfaces people would like:
I'm assuming exceptions are not a good idea, due to Node's async nature and there are certain constructs where we are explicitly async -- tokenizing can be streamed, and I assume when we start doing lookups to figure out what to do with templates we'll need to go async for seconds at a time.
I'm also assuming that 99.99% of the time we want a simple in-out interface as described above. But for testing and debugging, we want to instrument what's really going on. And we may want to pass control off for a while when we bring template parsing into the mix. So that means that either there are magic values, or there's some way to attach event listeners to the serializer? Is it okay to attach event listeners to the serializer without tying them to a specific pipeline of wikitext that's finding its way through the code?
On 12/28/2011 05:45 AM, Neil Kandalgaonkar wrote:
I pulled out most of the parser-y parts from the parserTests, leaving behind just tests.
Very good, this was really needed.
However, the parser is still a bit of a monster object, hence the deliberately silly name, ParserThingy.
I'm trying to decompose it into a chain, roughly like:
The current implementation already operates as a chain, as documented in https://www.mediawiki.org/wiki/Future/Parser_development:
PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree +------------------> (X)HTML serialization | V DomConverter | WikiDom V JSON serialization | JSON string V Visual Editor
The token stream transformation phase and to some degree the DOM postprocessor phase will soon differ in their configuration depending on the intended output format, enabled extensions and other wiki-specific settings. Output intended for viewing will have templates fully expanded and more aggressive sanitation applied in DOM postprocessors. Output destined for the editor will have templates and extension tag results encapsulated. At least, that is the plan so far- we might come up with better ways to handle this later.
The interface between the tokenizer, token stream transforms and the tree builder wrapper is currently synchronous with a single list of tokens being passed from one phase to the next. This should be changed to event emitters that emit chunks of tokens. The tree builder wrapper already implements the event emitter pattern to internally communicate with the HTML5 tree builder library.
The tree builder consumes token events until the end token is reached. The FauxHTML5.TreeBuilder wrapper could be extended to emit an additional signal when the end token was processed, so that DOM postprocessing and WikiDom conversion and JSON serialization can be triggered. All DOM-based processing is essentially synchronous and does not perform any IO, so these stages can all be called from a single function for now. This stage should in turn be an event emitter, so that you can register for further asynchronous processing of the result.
After the conversion to EvenEmitters, the wrapper object (the ParserThingy you just created) still configures the stages in a particular way, and registers the stages as event listeners with each other. The size of the wrapper can eventually be reduced a bit by pushing more of the phase-specific setup into the phase constructors and setup functions themselves. The high degree of decomposition into phases already there still means that a few lines of setup per phase will still add up to a 'monster object' of a few dozen lines. A reasonable price to pay for independent testing, potential parallel execution of stages and modularity, IMHO.
Finally, the wrapper will start the pipeline by calling the tokenizer. No result will be returned, but a callback is called or an event emitted when the pipeline is done.
I'm assuming exceptions are not a good idea, due to Node's async nature and there are certain constructs where we are explicitly async -- tokenizing can be streamed, and I assume when we start doing lookups to figure out what to do with templates we'll need to go async for seconds at a time.
Error reporting will have to happen in-band in the form of specific tokens or DOM nodes with specfic attributes that allow the editor or browser to render some error message. We should decide on an encapsulation for these that makes it easy to render or otherwise handle them generically. Exceptions should only be thrown for fatal bugs, but not network failures or similar.
I'm also assuming that 99.99% of the time we want a simple in-out interface as described above. But for testing and debugging, we want to instrument what's really going on. And we may want to pass control off for a while when we bring template parsing into the mix. So that means that either there are magic values, or there's some way to attach event listeners to the serializer?
Converting the pipeline to communicate using events is sufficient really. Apart from interface definitions regarding the representation of errors, tokens etc no magic values are involved. Note that the parse() function of a simplified wrapper will also require a callback to receive the result, or be an EventEmitter itself to support asynchronous processing.
Is it okay to attach event listeners to the serializer without tying them to a specific pipeline of wikitext that's finding its way through the code?
Depends on what you are trying to do. Reusing a parser pipeline for multiple parses will be fine (after adding implicit clean-ups for the tree builder phase). Your event receiver or callback will have to know what to do with the results from different parses though.
Gabriel
Sorry, I didn't mean to imply that this division was my idea or anything. The phases of parsing are explicit already. By 'monster' object I don't mean that it is large or incomprehensible, but that it has a few too many responsibilities to be easy to test.
For instance, right now it's returning its output as a property of itself, and the serializer is sort of added on later. The pipeline should be a bit clearer and more stateless.
Anyway this is easily fixed, and will be soon...
On 12/28/11 4:35 AM, Gabriel Wicke wrote:
On 12/28/2011 05:45 AM, Neil Kandalgaonkar wrote:
I pulled out most of the parser-y parts from the parserTests, leaving behind just tests.
Very good, this was really needed.
However, the parser is still a bit of a monster object, hence the deliberately silly name, ParserThingy.
I'm trying to decompose it into a chain, roughly like:
The current implementation already operates as a chain, as documented in https://www.mediawiki.org/wiki/Future/Parser_development:
PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree +------------------> (X)HTML serialization | V DomConverter | WikiDom V JSON serialization | JSON string V Visual Editor
The token stream transformation phase and to some degree the DOM postprocessor phase will soon differ in their configuration depending on the intended output format, enabled extensions and other wiki-specific settings. Output intended for viewing will have templates fully expanded and more aggressive sanitation applied in DOM postprocessors. Output destined for the editor will have templates and extension tag results encapsulated. At least, that is the plan so far- we might come up with better ways to handle this later.
The interface between the tokenizer, token stream transforms and the tree builder wrapper is currently synchronous with a single list of tokens being passed from one phase to the next. This should be changed to event emitters that emit chunks of tokens. The tree builder wrapper already implements the event emitter pattern to internally communicate with the HTML5 tree builder library.
The tree builder consumes token events until the end token is reached. The FauxHTML5.TreeBuilder wrapper could be extended to emit an additional signal when the end token was processed, so that DOM postprocessing and WikiDom conversion and JSON serialization can be triggered. All DOM-based processing is essentially synchronous and does not perform any IO, so these stages can all be called from a single function for now. This stage should in turn be an event emitter, so that you can register for further asynchronous processing of the result.
After the conversion to EvenEmitters, the wrapper object (the ParserThingy you just created) still configures the stages in a particular way, and registers the stages as event listeners with each other. The size of the wrapper can eventually be reduced a bit by pushing more of the phase-specific setup into the phase constructors and setup functions themselves. The high degree of decomposition into phases already there still means that a few lines of setup per phase will still add up to a 'monster object' of a few dozen lines. A reasonable price to pay for independent testing, potential parallel execution of stages and modularity, IMHO.
Finally, the wrapper will start the pipeline by calling the tokenizer. No result will be returned, but a callback is called or an event emitted when the pipeline is done.
I'm assuming exceptions are not a good idea, due to Node's async nature and there are certain constructs where we are explicitly async -- tokenizing can be streamed, and I assume when we start doing lookups to figure out what to do with templates we'll need to go async for seconds at a time.
Error reporting will have to happen in-band in the form of specific tokens or DOM nodes with specfic attributes that allow the editor or browser to render some error message. We should decide on an encapsulation for these that makes it easy to render or otherwise handle them generically. Exceptions should only be thrown for fatal bugs, but not network failures or similar.
I'm also assuming that 99.99% of the time we want a simple in-out interface as described above. But for testing and debugging, we want to instrument what's really going on. And we may want to pass control off for a while when we bring template parsing into the mix. So that means that either there are magic values, or there's some way to attach event listeners to the serializer?
Converting the pipeline to communicate using events is sufficient really. Apart from interface definitions regarding the representation of errors, tokens etc no magic values are involved. Note that the parse() function of a simplified wrapper will also require a callback to receive the result, or be an EventEmitter itself to support asynchronous processing.
Is it okay to attach event listeners to the serializer without tying them to a specific pipeline of wikitext that's finding its way through the code?
Depends on what you are trying to do. Reusing a parser pipeline for multiple parses will be fine (after adding implicit clean-ups for the tree builder phase). Your event receiver or callback will have to know what to do with the results from different parses though.
Gabriel
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
By 'monster' object I don't mean that it is large or incomprehensible, but that it has a few too many responsibilities to be easy to test.
Yeah, there is definitely a bit of cruft left that should disappear after eventification. I'll convert the TokenTransformDispatcher to event listener / emitter while refactoring it. That will remove the big callback that obscures the pipeline flow a bit.
For instance, right now it's returning its output as a property of itself, and the serializer is sort of added on later. The pipeline should be a bit clearer and more stateless.
Anyway this is easily fixed, and will be soon...
Awesome!
Gabriel
wikitext-l@lists.wikimedia.org