Digitization projects sometimes mention their OCR quality,
depending on the print quality, image quality, and which
OCR software they used, as a percentage in the range
80-100%. I guess that is the percentage of characters
correctly interpreted. When you outsource digitization,
the OCR quality can be a parameter for the delivery.
Anyway, I see so many OCR errors, that I doubt that these
estimates are accurate. Are there any known cases where
statements about OCR quality have been questioned?
One problem with estimating the OCR quality is that you
compare what you have (the actual OCR output) against
something you don't have (the perfectly proofread page).
You can make samples, but in Wikisource we have more than
just samples. We have complete works that have been
fully proofread. And a version history that shows what
we started out with. Yes, I think it is important to
save an initial version of the raw OCR text before you
start to do any proofreading.
Do we have any software that can compare two versions
of a page and tell what percentage of characters were
the same in both versions, i.e. the OCR quality?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se