On 2/21/08, Jay R. Ashworth <jra(a)baylink.com> wrote:
On Thu, Feb 21, 2008 at 01:16:22AM +1100, Steve
Bennett wrote:
Time to take this grammar and do something with
it.
Build a parser with it, run it against the corpus, and see how often
each individual rule pukes?
Ok. I've actually done a bit of that, but I guess I should ramp up the
scale. It can be hard to detect pukage without actually generating
XHTML and comparing it, though.
Generally, though, the answer is "not often". Flip through some random
wikitext. You'll find that a very small number of rules amount for the
vast majority of actual use. Though that may change once I have to
contend with the body of templates. People don't use tables much. They
don't use HTML tags or entities much. They almost never use magic
links (especially PMID - wtf is that about it). They almost never use
horizontal rules, HTML comments and rarely even extensions like <ref>
Steve