On Thu, Feb 21, 2008 at 01:12:34PM +1100, Steve Bennett wrote:
On 2/21/08, Jay R. Ashworth <jra(a)baylink.com>
wrote:
On Thu, Feb 21, 2008 at 01:16:22AM +1100, Steve
Bennett wrote:
Time to take this grammar and do something with
it.
Build a parser with it, run it against the corpus, and see how often
each individual rule pukes?
Ok. I've actually done a bit of that, but I guess I should ramp up the
scale. It can be hard to detect pukage without actually generating
XHTML and comparing it, though.
Generally, though, the answer is "not often". Flip through some random
wikitext. You'll find that a very small number of rules amount for the
vast majority of actual use. Though that may change once I have to
contend with the body of templates. People don't use tables much. They
don't use HTML tags or entities much. They almost never use magic
links (especially PMID - wtf is that about it). They almost never use
horizontal rules, HTML comments and rarely even extensions like <ref>
I don't know if you remember it at this point, Steve, but one of the
reasons I threw "won't someone *please* build us a grammar-driven
parser" up in the air (and thanks, BTW :-), was precisely to get a
fairly reliable count of how often each possible bit'o'grammer appears
in, say, en.wp, so as to get a feeling for what will break if the
syntax is restricted slightly...
That is to say that I concur with your instinct: 90/10 rule, I would
guess, here.
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra(a)baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates
http://baylink.pitas.com '87 e24
St Petersburg FL USA
http://photo.imageinc.us +1 727 647 1274
Those who cast the vote decide nothing.
Those who count the vote decide everything.
-- (Joseph Stalin)