On Thu, Feb 21, 2008 at 01:12:34PM +1100, Steve Bennett wrote:
On 2/21/08, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Feb 21, 2008 at 01:16:22AM +1100, Steve Bennett wrote:
Time to take this grammar and do something with it.
Build a parser with it, run it against the corpus, and see how often each individual rule pukes?
Ok. I've actually done a bit of that, but I guess I should ramp up the scale. It can be hard to detect pukage without actually generating XHTML and comparing it, though.
Generally, though, the answer is "not often". Flip through some random wikitext. You'll find that a very small number of rules amount for the vast majority of actual use. Though that may change once I have to contend with the body of templates. People don't use tables much. They don't use HTML tags or entities much. They almost never use magic links (especially PMID - wtf is that about it). They almost never use horizontal rules, HTML comments and rarely even extensions like <ref>
I don't know if you remember it at this point, Steve, but one of the reasons I threw "won't someone *please* build us a grammar-driven parser" up in the air (and thanks, BTW :-), was precisely to get a fairly reliable count of how often each possible bit'o'grammer appears in, say, en.wp, so as to get a feeling for what will break if the syntax is restricted slightly...
That is to say that I concur with your instinct: 90/10 rule, I would guess, here.
Cheers, -- jra