[Wikisource-l] [cultural-partners] ABBYY Finereader 11 on Toolserver: do we like it?

Mon Nov 28 13:25:21 UTC 2011

On 11/28/2011 01:59 PM, Mathias Schindler wrote:
> I recommend sticking and supporting open source technology that has
> been made available by third parties, such as
> http://code.google.com/p/ocropus/ /
> http://code.google.com/p/tesseract-ocr/

Do you recommend this based on experience, or based on free software
ideology? Apparently the Internet Archive tried and gave up, because
Finereader was far better. Are there any good examples where free
software has been used for good OCR quality?

Wikisource does provide feedback on quality: After OCR, when a page
has been proofread, the OCR software could learn from the diff.
But is there any OCR software that can take this kind of input?

When running OCR as an engine/server/API, what do we do when it
misinterprets columns in a page, and reads long lines across the
page? Is there a way to manually indicate where columns are, and
resubmit the page for new OCR?

-- 
   Lars Aronsson (lars at aronsson.se)
   Aronsson Datateknik - http://aronsson.se