On Sat, Jul 9, 2011 at 2:54 PM, Jay Ashworth <span dir="ltr">&lt;<a href="mailto:jra@baylink.com">jra@baylink.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

How good is good enough?<br>

<br>

How many pages is a replacement parser allowed to break, and still be<br>

certified?<br>

<br>

That is: what is the *real* spec for mediawikitext?  If we say &quot;the formal<br>

grammar&quot;, then we are *guaranteed* to break some articles.  That&#39;s the<br>

&quot;Right Answer&quot;, from up here at 40,000 feet, where I watch from (having<br>

the luxury of not being responsible in any way for any of this :-), but<br>

it will involve breaking some eggs.<br>

<br>

I bring this back up because, the last time we had this conversation, the<br>

answer was &quot;nope; the new parser will have to be bug-for-bug compatible<br>

with the current one&quot;.  Or something pretty close to that.<br></blockquote><div><br>Officially speaking, the spec/new implementation will &quot;win&quot;.<br><br>

The theory is to have a defined, reimplementable, more extensible version of pretty much 

what we have now, so we can always read and work with the data that we&#39;ve already got.<br>

<br>It will have to be reasonably close to the old behavior, but we *know* it won&#39;t be exact. &quot;Reasonable&quot; will, ultimately, be fairly arbitrary, and will be determined based on our progress and findings some months down the road.<br>

<br>Running big batch comparisons determining visible or semantic differences between the old &amp; new parsers will be an ongoing part of that, and so we&#39;ll probably have some pretty numbers or graphs to look at as we get closer to something that can be dropped in.<br>

<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

I just think this is a question -- and answer -- that people should be<br>

slowly internalizing as we proceed down this path.<br>

<br>

1) Formal Spec<br>

2) Multiple Implementations<br>

3) Test Suite<br>

<br>

I don&#39;t think it&#39;s completely unreasonable that we might have a way to<br>

grind articles against the current parser, and each new parser, and diff<br>

the output.  Something like that&#39;s the only way I can see that we *will*<br>

be able to tell how close new parsers come, and on which constructs they<br>

break (not that this means that I think The Wikipedia Corpus constitutes<br>

a valid Test Suite :-).<br></blockquote><div><br>Definitely. :)<br><br>Output comparisons can be a tricky business, but it&#39;ll be an important component. I&#39;ve started on a CLI batch testing framework for the JavaScript parser class (using node.js; in ParserPlayground/tests) that can read through a Wikipedia XML dump and run round-tripping checks; moving on to run HTML output and comparing against output from current MediaWiki parser will be very valuable (though doing comparisons on HTML is tricky!)<br>

<br>-- brion<br></div></div>