On Mon, Jul 11, 2011 at 4:46 PM, Erik Rose <span dir="ltr"><<a href="mailto:erik@mozilla.com">erik@mozilla.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Say, while everybody's trying to figure out a formal grammar, have you had a look at Ward Cunningham's exploratory parsing kit? He gave me a demo at OSBridge, and it's a really handy tool. Basically, it's a web app with an asynchronous C backend. You paste a tentative PEG grammar into a textarea, and it runs through whatever corpus you want, showing you representative instances of how it does or does not match. He was running it against the full English Wikipedia on his laptop, and it took only half an hour or something—with results coming in as they were generated, of course. </blockquote>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<br>
Using that, they made a PEG-and-then-some implementation of MW syntax that parses darn near all of Wikipedia: <a href="https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg" target="_blank">https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg</a>. (I call it "PEG-and-then-some" because it does have a lot of callbacks which might interlock with and affect the rule matching—not sure.)<br>
</blockquote><div><br>It is indeed dang impressive -- I expect to be stealing at least some of those grammar rules. :)<br><br>We are however producing a different sort of intermediate structure rather than going straight to HTML output, so things won't be an exact match (especially where we do template stuff).<br>
<br>-- brion<br></div></div>