On 4/30/2010 17:44, Lars Aronsson wrote:
Digitization projects sometimes mention their OCR quality, depending on the print quality, image quality, and which OCR software they used, as a percentage in the range 80-100%. I guess that is the percentage of characters correctly interpreted. When you outsource digitization, the OCR quality can be a parameter for the delivery.
Anyway, I see so many OCR errors, that I doubt that these estimates are accurate. Are there any known cases where statements about OCR quality have been questioned?
One problem with estimating the OCR quality is that you compare what you have (the actual OCR output) against something you don't have (the perfectly proofread page). You can make samples, but in Wikisource we have more than just samples. We have complete works that have been fully proofread. And a version history that shows what we started out with. Yes, I think it is important to save an initial version of the raw OCR text before you start to do any proofreading.
Do we have any software that can compare two versions of a page and tell what percentage of characters were the same in both versions, i.e. the OCR quality?
I once read that scanner industry in its promotion of OCR plays on difficulty most people have with interpretation of probabilities: most people don't easily realize that a 99% reliable OCR implies one error per line on a densely printed page in small type. Anything below 99% become very tedious to use, anything below 95% seems utterly useless.
Erik Zachte