On 11/28/2011 01:59 PM, Mathias Schindler wrote:
I recommend sticking and supporting open source technology that has been made available by third parties, such as http://code.google.com/p/ocropus/ / http://code.google.com/p/tesseract-ocr/
Do you recommend this based on experience, or based on free software ideology? Apparently the Internet Archive tried and gave up, because Finereader was far better. Are there any good examples where free software has been used for good OCR quality?
Wikisource does provide feedback on quality: After OCR, when a page has been proofread, the OCR software could learn from the diff. But is there any OCR software that can take this kind of input?
When running OCR as an engine/server/API, what do we do when it misinterprets columns in a page, and reads long lines across the page? Is there a way to manually indicate where columns are, and resubmit the page for new OCR?