On 12/28/2011 05:45 AM, Neil Kandalgaonkar wrote:
I pulled out most of the parser-y parts from the
parserTests, leaving
behind just tests.
Very good, this was really needed.
However, the parser is still a bit of a monster
object, hence the
deliberately silly name, ParserThingy.
I'm trying to decompose it into a chain, roughly like:
The current implementation already operates as a chain, as documented in
https://www.mediawiki.org/wiki/Future/Parser_development:
PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers)
| Chunks of tokens
V
Token stream transformations
| Chunks of tokens
V
HTML5 tree builder
| HTML 5 DOM tree
V
DOM Postprocessors
| HTML5 DOM tree
+------------------> (X)HTML serialization
|
V
DomConverter
| WikiDom
V
JSON serialization
| JSON string
V
Visual Editor
The token stream transformation phase and to some degree the DOM
postprocessor phase will soon differ in their configuration depending on
the intended output format, enabled extensions and other wiki-specific
settings. Output intended for viewing will have templates fully expanded
and more aggressive sanitation applied in DOM postprocessors. Output
destined for the editor will have templates and extension tag results
encapsulated. At least, that is the plan so far- we might come up with
better ways to handle this later.
The interface between the tokenizer, token stream transforms and the
tree builder wrapper is currently synchronous with a single list of
tokens being passed from one phase to the next. This should be changed
to event emitters that emit chunks of tokens. The tree builder wrapper
already implements the event emitter pattern to internally communicate
with the HTML5 tree builder library.
The tree builder consumes token events until the end token is reached.
The FauxHTML5.TreeBuilder wrapper could be extended to emit an
additional signal when the end token was processed, so that DOM
postprocessing and WikiDom conversion and JSON serialization can be
triggered. All DOM-based processing is essentially synchronous and does
not perform any IO, so these stages can all be called from a single
function for now. This stage should in turn be an event emitter, so that
you can register for further asynchronous processing of the result.
After the conversion to EvenEmitters, the wrapper object (the
ParserThingy you just created) still configures the stages in a
particular way, and registers the stages as event listeners with each
other. The size of the wrapper can eventually be reduced a bit by
pushing more of the phase-specific setup into the phase constructors and
setup functions themselves. The high degree of decomposition into phases
already there still means that a few lines of setup per phase will still
add up to a 'monster object' of a few dozen lines. A reasonable price to
pay for independent testing, potential parallel execution of stages and
modularity, IMHO.
Finally, the wrapper will start the pipeline by calling the tokenizer.
No result will be returned, but a callback is called or an event emitted
when the pipeline is done.
I'm assuming exceptions are not a good idea, due
to Node's async nature
and there are certain constructs where we are explicitly async --
tokenizing can be streamed, and I assume when we start doing lookups to
figure out what to do with templates we'll need to go async for seconds
at a time.
Error reporting will have to happen in-band in the form of specific
tokens or DOM nodes with specfic attributes that allow the editor or
browser to render some error message. We should decide on an
encapsulation for these that makes it easy to render or otherwise handle
them generically. Exceptions should only be thrown for fatal bugs, but
not network failures or similar.
I'm also assuming that 99.99% of the time we want
a simple in-out
interface as described above. But for testing and debugging, we want to
instrument what's really going on. And we may want to pass control off
for a while when we bring template parsing into the mix. So that means
that either there are magic values, or there's some way to attach event
listeners to the serializer?
Converting the pipeline to communicate using events is sufficient
really. Apart from interface definitions regarding the representation of
errors, tokens etc no magic values are involved. Note that the parse()
function of a simplified wrapper will also require a callback to receive
the result, or be an EventEmitter itself to support asynchronous processing.
Is it okay to attach event listeners to the
serializer without tying them to a specific pipeline of wikitext that's
finding its way through the code?
Depends on what you are trying to do. Reusing a parser pipeline for
multiple parses will be fine (after adding implicit clean-ups for the
tree builder phase). Your event receiver or callback will have to know
what to do with the results from different parses though.
Gabriel