On Thu, Feb 21, 2008 at 03:43:54PM +1100, Steve Bennett wrote:
On 2/21/08, Jay R. Ashworth <jra(a)baylink.com>
wrote:
I don't know if you remember it at this
point, Steve, but one of the
reasons I threw "won't someone *please* build us a grammar-driven
parser" up in the air (and thanks, BTW :-), was precisely to get a
fairly reliable count of how often each possible bit'o'grammer appears
in, say, en.wp, so as to get a feeling for what will break if the
syntax is restricted slightly...
That is to say that I concur with your instinct: 90/10 rule, I would
guess, here.
Ah, yes.
Well, it really should be pretty easy to produce some stats like "in
this corpus, there are 850 inline links, 560 images, 227 bullet list
items" etc. It will be harder to detect subtle things like "things
which closely resemble, but aren't, inline images" or "external links
wrapped in double square brackets by some moron".
Oh sure. Building the test harness will be an iterative process. But
once someone does, we'll actually have not only a formal grammar, but a
second reference implementation...
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra(a)baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates
http://baylink.pitas.com '87 e24
St Petersburg FL USA
http://photo.imageinc.us +1 727 647 1274
Those who cast the vote decide nothing.
Those who count the vote decide everything.
-- (Joseph Stalin)