[Wikitext-l] "When Is A New Parser Done?" ( was: Re: [Wikitech-l] Parser bugs and their priority)

Sun Jul 17 17:16:07 UTC 2011

On Sat, Jul 9, 2011 at 2:54 PM, Jay Ashworth <jra at baylink.com> wrote:

> How good is good enough?
>
> How many pages is a replacement parser allowed to break, and still be
> certified?
>
> That is: what is the *real* spec for mediawikitext?  If we say "the formal
> grammar", then we are *guaranteed* to break some articles.  That's the
> "Right Answer", from up here at 40,000 feet, where I watch from (having
> the luxury of not being responsible in any way for any of this :-), but
> it will involve breaking some eggs.
>
> I bring this back up because, the last time we had this conversation, the
> answer was "nope; the new parser will have to be bug-for-bug compatible
> with the current one".  Or something pretty close to that.
>

Officially speaking, the spec/new implementation will "win".

The theory is to have a defined, reimplementable, more extensible version of
pretty much what we have now, so we can always read and work with the data
that we've already got.

It will have to be reasonably close to the old behavior, but we *know* it
won't be exact. "Reasonable" will, ultimately, be fairly arbitrary, and will
be determined based on our progress and findings some months down the road.

Running big batch comparisons determining visible or semantic differences
between the old & new parsers will be an ongoing part of that, and so we'll
probably have some pretty numbers or graphs to look at as we get closer to
something that can be dropped in.

I just think this is a question -- and answer -- that people should be
> slowly internalizing as we proceed down this path.
>
> 1) Formal Spec
> 2) Multiple Implementations
> 3) Test Suite
>
> I don't think it's completely unreasonable that we might have a way to
> grind articles against the current parser, and each new parser, and diff
> the output.  Something like that's the only way I can see that we *will*
> be able to tell how close new parsers come, and on which constructs they
> break (not that this means that I think The Wikipedia Corpus constitutes
> a valid Test Suite :-).
>

Definitely. :)

Output comparisons can be a tricky business, but it'll be an important
component. I've started on a CLI batch testing framework for the JavaScript
parser class (using node.js; in ParserPlayground/tests) that can read
through a Wikipedia XML dump and run round-tripping checks; moving on to run
HTML output and comparing against output from current MediaWiki parser will
be very valuable (though doing comparisons on HTML is tricky!)

-- brion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/wikitext-l/attachments/20110717/f99eb89c/attachment.htm