On 04/02/2012 07:27 PM, Ori Livneh wrote:
Awesome -- thanks!
I have a couple of questions to ask, too. These aren't urgent, so feel
free to reply whenever you have the time.
1) Conformance vs. enhancement
I'm not sure if you saw my comment in the diff, but some of the ISBN
validation code I wrote is deliberately unreachable because it's
stricter than the naive validation the Mediawiki currently performs. (I
stupidly wrote it before checking to see what exactly the current parser
does.) In general, is it ever desirable to try and improve on the old
parsing grammar or is total conformance the overriding goal?
Breaking commonly-used wikitext constructs would need a very good
justification, while enhancements / clean-ups for little-used edge cases
are a good idea. I created a dumpGrepper tool (see
VisualEditor/tests/parser/dumpGrepper.js) to check how common some
constructs are.
For ISBN in particular, the following bug discussed the trade-offs
involved in stricter validation:
https://bugzilla.wikimedia.org/show_bug.cgi?id=2391
There is now strict validation and a warning in Special:Booksources, but
the parser still recognizes invalid ISBNs.
2) ECMAScript target
Should I strive to write code that is compatible with older browsers?
Thus far I've avoided ES 1.5 constructs like forEach, map, filter, etc.
But if this is only going to run server-side under node for
the foreseeable future, maybe that's an unnecessary handicap.
We don't have concrete plans to run this code on old clients, and by the
time that would happen most browsers might be recent enough. Performance
is important though, and simple constructs like for / while loops were
still faster on V8 last time I checked.
3) Intermediate representation vs. parser hacks
The preliminary work I've done to parse behavior switches (__TOC__ &
friends) has the parser directly toggle attributes on a configuration
object. It does this rather than produce a token for some subsequent
tree visitor to interpret. So the parsing is arguably somewhat lossy in
this case, but possibly not: when serializing back wikitext, you could
read all behaviors from the configuration object and encode them at the
top; their position isn't significant, afaik.)
The position of the switch is still significant to avoid dirty diffs.
Adam Wight actually implemented a bit of the back-end infrastructure for
this in
https://gerrit.wikimedia.org/r/#change,4050. We should be able
to combine the two patches by joining your tokenizer work with Adam's
token transformer work.
So: did I make the right decision? PEG.js's
support for arbitrary
JavaScript code in the grammar makes this tempting to do sometimes, but
perhaps it's wiser to insist that the parser not overreach the goal of
building a tree. What is your take?
The tokenizer should try hard to produce something round-trippable while
still trying to perform most context-free parsing if it can. It does
build a parse tree in the process, but then immediately flattens it by
emitting a stream of tokens. In this case we should emit tokens for the
behavior switches, which are then handled in a token stream transformer
(see Adam's patch).
I plan to port the wikitext serializer in the editor to operate on
tokens instead, so that we can do round-trip testing at different levels
in the processing chain. Serializing a DOM subtree is then equivalent to
walking the tree and calling the start / end token serializers for each
DOM node. Behavior switches thus need to be represented as (invisible)
nodes in the DOM as well to support round-tripping via the DOM. HTML5
Microdata would suggest the meta element for this, while I am not 100%
sure which would be the preferred element for RDFa. Perhaps a span
without text content. The mapping from internal token to meta / span can
be performed in the token stream transformer. We should standardize on a
single way to mark up invisible content to simplify serialization.
Gabriel