Proofreading based on statistics - Wikisource-l

23 May 2013

It should be possible, in any language of Wikisource, to
check all existing text against a known dictionary valid
for that year, and to find words that are outside the
dictionary. These words could be proofread in some tool
similar to a CAPTCHA. They might be uncommon place names
that are correctly OCRed but not in the dictionary, or
they could be OCR errors, or both.

Has anybody tried this?

Such finds are not necessarily the only OCR errors.
Some OCR errors result in correctly spelled words, that
are found in the dictionary, e.g. burn -> bum.
So full manual proofreading and validation will still be
needed. But a statistics based approach could fill gaps
and quickly improve full text searchability.

-- 
   Lars Aronsson (lars(a)aronsson.se)
   Aronsson Datateknik - http://aronsson.se

   Project Runeberg - free Nordic literature - http://runeberg.org/