Re: [Wikitext-l] "When Is A New Parser Done?" ( was: Re: [Wikitech-l] Parser bugs and their priority)

17 Jul 2011


      On Sat, Jul 9, 2011 at 2:54 PM, Jay Ashworth jra@baylink.com wrote:
...
How good is good enough?
How many pages is a replacement parser allowed to break, and still be
certified?
That is: what is the *real* spec for mediawikitext?  If we say "the formal
grammar", then we are *guaranteed* to break some articles.  That's the
"Right Answer", from up here at 40,000 feet, where I watch from (having
the luxury of not being responsible in any way for any of this :-), but
it will involve breaking some eggs.
I bring this back up because, the last time we had this conversation, the
answer was "nope; the new parser will have to be bug-for-bug compatible
with the current one".  Or something pretty close to that.
Officially speaking, the spec/new implementation will "win".
The theory is to have a defined, reimplementable, more extensible version of
pretty much what we have now, so we can always read and work with the data
that we've already got.
It will have to be reasonably close to the old behavior, but we *know* it
won't be exact. "Reasonable" will, ultimately, be fairly arbitrary, and will
be determined based on our progress and findings some months down the road.
Running big batch comparisons determining visible or semantic differences
between the old & new parsers will be an ongoing part of that, and so we'll
probably have some pretty numbers or graphs to look at as we get closer to
something that can be dropped in.
I just think this is a question -- and answer -- that people should be
...
slowly internalizing as we proceed down this path.

Formal Spec
Multiple Implementations
Test Suite

I don't think it's completely unreasonable that we might have a way to
grind articles against the current parser, and each new parser, and diff
the output.  Something like that's the only way I can see that we *will*
be able to tell how close new parsers come, and on which constructs they
break (not that this means that I think The Wikipedia Corpus constitutes
a valid Test Suite :-).
Definitely. :)
Output comparisons can be a tricky business, but it'll be an important
component. I've started on a CLI batch testing framework for the JavaScript
parser class (using node.js; in ParserPlayground/tests) that can read
through a Wikipedia XML dump and run round-tripping checks; moving on to run
HTML output and comparing against output from current MediaWiki parser will
be very valuable (though doing comparisons on HTML is tricky!)
-- brion

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] "When Is A New Parser Done?" ( was: Re: [Wikitech-l] Parser bugs and their priority)