On 11/28/2011 01:59 PM, Mathias Schindler wrote:
I recommend sticking and supporting open source
technology that has
been made available by third parties, such as
http://code.google.com/p/ocropus/ /
http://code.google.com/p/tesseract-ocr/
Do you recommend this based on experience, or based on free software
ideology? Apparently the Internet Archive tried and gave up, because
Finereader was far better. Are there any good examples where free
software has been used for good OCR quality?
Wikisource does provide feedback on quality: After OCR, when a page
has been proofread, the OCR software could learn from the diff.
But is there any OCR software that can take this kind of input?
When running OCR as an engine/server/API, what do we do when it
misinterprets columns in a page, and reads long lines across the
page? Is there a way to manually indicate where columns are, and
resubmit the page for new OCR?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se