On Sat, Jul 9, 2011 at 2:54 PM, Jay Ashworth <jra@baylink.com> wrote:
How good is good enough?
How many pages is a replacement parser allowed to break, and still be
certified?
That is: what is the *real* spec for mediawikitext? If we say "the formal
grammar", then we are *guaranteed* to break some articles. That's the
"Right Answer", from up here at 40,000 feet, where I watch from (having
the luxury of not being responsible in any way for any of this :-), but
it will involve breaking some eggs.
I bring this back up because, the last time we had this conversation, the
answer was "nope; the new parser will have to be bug-for-bug compatible
with the current one". Or something pretty close to that.
I just think this is a question -- and answer -- that people should be
slowly internalizing as we proceed down this path.
1) Formal Spec
2) Multiple Implementations
3) Test Suite
I don't think it's completely unreasonable that we might have a way to
grind articles against the current parser, and each new parser, and diff
the output. Something like that's the only way I can see that we *will*
be able to tell how close new parsers come, and on which constructs they
break (not that this means that I think The Wikipedia Corpus constitutes
a valid Test Suite :-).