On Sat, Jul 9, 2011 at 2:54 PM, Jay Ashworth <jra@baylink.com> wrote:
How good is good enough?

How many pages is a replacement parser allowed to break, and still be
certified?

That is: what is the *real* spec for mediawikitext?  If we say "the formal
grammar", then we are *guaranteed* to break some articles.  That's the
"Right Answer", from up here at 40,000 feet, where I watch from (having
the luxury of not being responsible in any way for any of this :-), but
it will involve breaking some eggs.

I bring this back up because, the last time we had this conversation, the
answer was "nope; the new parser will have to be bug-for-bug compatible
with the current one".  Or something pretty close to that.

Officially speaking, the spec/new implementation will "win".

The theory is to have a defined, reimplementable, more extensible version of pretty much what we have now, so we can always read and work with the data that we've already got.

It will have to be reasonably close to the old behavior, but we *know* it won't be exact. "Reasonable" will, ultimately, be fairly arbitrary, and will be determined based on our progress and findings some months down the road.

Running big batch comparisons determining visible or semantic differences between the old & new parsers will be an ongoing part of that, and so we'll probably have some pretty numbers or graphs to look at as we get closer to something that can be dropped in.

I just think this is a question -- and answer -- that people should be
slowly internalizing as we proceed down this path.

1) Formal Spec
2) Multiple Implementations
3) Test Suite

I don't think it's completely unreasonable that we might have a way to
grind articles against the current parser, and each new parser, and diff
the output.  Something like that's the only way I can see that we *will*
be able to tell how close new parsers come, and on which constructs they
break (not that this means that I think The Wikipedia Corpus constitutes
a valid Test Suite :-).

Definitely. :)

Output comparisons can be a tricky business, but it'll be an important component. I've started on a CLI batch testing framework for the JavaScript parser class (using node.js; in ParserPlayground/tests) that can read through a Wikipedia XML dump and run round-tripping checks; moving on to run HTML output and comparing against output from current MediaWiki parser will be very valuable (though doing comparisons on HTML is tricky!)

-- brion