Re: [Wikisource-l] OCR quality

30 Apr 2010

On 4/30/2010 17:44, Lars Aronsson wrote:
...
  Digitization projects sometimes mention their OCR
quality,
 depending on the print quality, image quality, and which
 OCR software they used, as a percentage in the range
 80-100%. I guess that is the percentage of characters
 correctly interpreted. When you outsource digitization,
 the OCR quality can be a parameter for the delivery.

 Anyway, I see so many OCR errors, that I doubt that these
 estimates are accurate. Are there any known cases where
 statements about OCR quality have been questioned?

 One problem with estimating the OCR quality is that you
 compare what you have (the actual OCR output) against
 something you don't have (the perfectly proofread page).
 You can make samples, but in Wikisource we have more than
 just samples. We have complete works that have been
 fully proofread. And a version history that shows what
 we started out with. Yes, I think it is important to
 save an initial version of the raw OCR text before you
 start to do any proofreading.

 Do we have any software that can compare two versions
 of a page and tell what percentage of characters were
 the same in both versions, i.e. the OCR quality?

     I once read that scanner industry in its promotion of OCR plays on 
difficulty most people have with interpretation of probabilities:
most people don't easily realize that a 99% reliable OCR implies one 
error per line on a densely printed page in small type.
Anything below 99% become very tedious to use, anything below 95% seems 
utterly useless.

Erik Zachte

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] OCR quality