On Thu, Feb 21, 2008 at 03:43:54PM +1100, Steve Bennett wrote:
On 2/21/08, Jay R. Ashworth jra@baylink.com wrote:
I don't know if you remember it at this point, Steve, but one of the reasons I threw "won't someone *please* build us a grammar-driven parser" up in the air (and thanks, BTW :-), was precisely to get a fairly reliable count of how often each possible bit'o'grammer appears in, say, en.wp, so as to get a feeling for what will break if the syntax is restricted slightly...
That is to say that I concur with your instinct: 90/10 rule, I would guess, here.
Ah, yes.
Well, it really should be pretty easy to produce some stats like "in this corpus, there are 850 inline links, 560 images, 227 bullet list items" etc. It will be harder to detect subtle things like "things which closely resemble, but aren't, inline images" or "external links wrapped in double square brackets by some moron".
Oh sure. Building the test harness will be an iterative process. But once someone does, we'll actually have not only a formal grammar, but a second reference implementation...
Cheers, -- jra