On Sat, Nov 10, 2007 at 05:30:53PM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
They certainly are, if no one ever examines the corpus. I've just banged up a new server in the office, if no one else who already *has* a mirror of, say, en.wp set up steps up, I may do the testing myself, in my Copious Free Time.
What are you proposing, autobotically replacing ''' with **?
Specifically, I was proposing defining the combinations of the current parser tokens which are difficult to interpret (primarily, combinations of bold, italics, and apostrophes), and determining how frequently they appear in the live corpus.
This will delimit the *actual* size of the Installed Base problem, in both meanings I gave it earlier. If in 2 megapages, there are only 100 occurrences, you fix them by hand. If 1000, you grind a robot. If 500K, then you take a different approach to the overall problem.
(To USAdians, this is referred to as "Dropping back 10, and punting".)
Cheers, -- jra