On 04/02/2012 07:27 PM, Ori Livneh wrote:
Awesome -- thanks!
I have a couple of questions to ask, too. These aren't urgent, so feel free to reply whenever you have the time.
- Conformance vs. enhancement
I'm not sure if you saw my comment in the diff, but some of the ISBN validation code I wrote is deliberately unreachable because it's stricter than the naive validation the Mediawiki currently performs. (I stupidly wrote it before checking to see what exactly the current parser does.) In general, is it ever desirable to try and improve on the old parsing grammar or is total conformance the overriding goal?
Breaking commonly-used wikitext constructs would need a very good justification, while enhancements / clean-ups for little-used edge cases are a good idea. I created a dumpGrepper tool (see VisualEditor/tests/parser/dumpGrepper.js) to check how common some constructs are.
For ISBN in particular, the following bug discussed the trade-offs involved in stricter validation: https://bugzilla.wikimedia.org/show_bug.cgi?id=2391 There is now strict validation and a warning in Special:Booksources, but the parser still recognizes invalid ISBNs.
- ECMAScript target
Should I strive to write code that is compatible with older browsers? Thus far I've avoided ES 1.5 constructs like forEach, map, filter, etc. But if this is only going to run server-side under node for the foreseeable future, maybe that's an unnecessary handicap.
We don't have concrete plans to run this code on old clients, and by the time that would happen most browsers might be recent enough. Performance is important though, and simple constructs like for / while loops were still faster on V8 last time I checked.
- Intermediate representation vs. parser hacks
The preliminary work I've done to parse behavior switches (__TOC__ & friends) has the parser directly toggle attributes on a configuration object. It does this rather than produce a token for some subsequent tree visitor to interpret. So the parsing is arguably somewhat lossy in this case, but possibly not: when serializing back wikitext, you could read all behaviors from the configuration object and encode them at the top; their position isn't significant, afaik.)
The position of the switch is still significant to avoid dirty diffs. Adam Wight actually implemented a bit of the back-end infrastructure for this in https://gerrit.wikimedia.org/r/#change,4050. We should be able to combine the two patches by joining your tokenizer work with Adam's token transformer work.
So: did I make the right decision? PEG.js's support for arbitrary JavaScript code in the grammar makes this tempting to do sometimes, but perhaps it's wiser to insist that the parser not overreach the goal of building a tree. What is your take?
The tokenizer should try hard to produce something round-trippable while still trying to perform most context-free parsing if it can. It does build a parse tree in the process, but then immediately flattens it by emitting a stream of tokens. In this case we should emit tokens for the behavior switches, which are then handled in a token stream transformer (see Adam's patch).
I plan to port the wikitext serializer in the editor to operate on tokens instead, so that we can do round-trip testing at different levels in the processing chain. Serializing a DOM subtree is then equivalent to walking the tree and calling the start / end token serializers for each DOM node. Behavior switches thus need to be represented as (invisible) nodes in the DOM as well to support round-tripping via the DOM. HTML5 Microdata would suggest the meta element for this, while I am not 100% sure which would be the preferred element for RDFa. Perhaps a span without text content. The mapping from internal token to meta / span can be performed in the token stream transformer. We should standardize on a single way to mark up invisible content to simplify serialization.
Gabriel