On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
Specifically, I was proposing defining the combinations of the current parser tokens which are difficult to interpret (primarily, combinations of bold, italics, and apostrophes), and determining how frequently they appear in the live corpus.
This will delimit the *actual* size of the Installed Base problem, in both meanings I gave it earlier. If in 2 megapages, there are only 100 occurrences, you fix them by hand. If 1000, you grind a robot. If 500K, then you take a different approach to the overall problem.
Ok, it's still backwards from how I would picture it: 1) Come up with a solution (ie, new parser) 2) See how many pages that solution fits, call it X%. 3) If X% is too small, either extend the parser by adding more rules, or updating pages.
But this is probably just philosophy at this point: I'd rather be focussing on the grammar that we want to implement, than the grammar that we don't want to implement.
Steve