On 11/28/2011 01:59 PM, Mathias Schindler wrote:
I recommend sticking and supporting open source technology that has been made available by third parties, such as http://code.google.com/p/ocropus/ / http://code.google.com/p/tesseract-ocr/
Do you recommend this based on experience, or based on free software ideology? Apparently the Internet Archive tried and gave up, because Finereader was far better. Are there any good examples where free software has been used for good OCR quality?
Wikisource does provide feedback on quality: After OCR, when a page has been proofread, the OCR software could learn from the diff. But is there any OCR software that can take this kind of input?
When running OCR as an engine/server/API, what do we do when it misinterprets columns in a page, and reads long lines across the page? Is there a way to manually indicate where columns are, and resubmit the page for new OCR?
I'm going to upgrade my licensed FineReader 10 to FineReader 11 (so that it.source too will have a volunteer with a legal FineReader 11 software... :-) ), I downloaded the trial software and I can confirm that it produces a complete djvu file (images and text layer) within a single step.
Text layer hasn't full range of details, it's organized into two levels (page and line), while OCR engine on IA servers produces a very rich "tree" (page, column, region, paragraph, line and word). Images can't be finely tuned, but it is possible, given images of the same width/height of better quality, to "transplant" text layer into a different djvu with a few DjvuLibre commands.
Is anyone of you interested into a rather deep exploration of djvu text layer by python? I'm working about it, but I feel that there's so much to do, and so much to gain. I'm currently working into a Windows dropbox folder, containing DjvuLibre routines too.
Alex_brollo
On 11/28/2011 10:23 PM, Alex Brollo wrote:
[...] FineReader 11 [...] produces a complete djvu file [...] Text layer hasn't full range of details, it's organized into two levels (page and line), while OCR engine on IA servers produces a very rich "tree" (page, column, region, paragraph, line and word).
Has anybody designed a web interface that shows the scanned image and the zones or regions of the Djvu text layer? It would look similar to image annotation on Commons, http://commons.wikimedia.org/wiki/Commons:Image_annotations
For a Djvu file uploaded to Commons, could you automatically generate image annotations for the various text columns and illustrations? Does image annotation handle multi-page document formats such as PDF and Djvu?
(Shouldn't image annotations and timed text be the same thing?)
2011/11/29 Lars Aronsson lars@aronsson.se
On 11/28/2011 10:23 PM, Alex Brollo wrote:
[...] FineReader 11 [...] produces a complete djvu file [...] Text layer hasn't full range of details, it's organized into two levels (page and line), while OCR engine on IA servers produces a very rich "tree" (page, column, region, paragraph, line and word).
Has anybody designed a web interface that shows the scanned image and the zones or regions of the Djvu text layer? It would look similar to image annotation on Commons, http://commons.wikimedia.org/wiki/Commons:Image_annotations
For a Djvu file uploaded to Commons, could you automatically generate image annotations for the various text columns and illustrations? Does image annotation handle multi-page document formats such as PDF and Djvu?
Thanks for interesing questions. I'm exploring as deeply as I can djvu text layer, metadata, anf informations wrapped into djvu file, and my feel is that djvu support is very primitive, the first needed step perhaps being conversion from "bundled" to "indirect" format; djvu files into the web are great exactly because single pages can be shared into the web, with their complete content.
I'll take a look to Image annotations, I don't know anything about them even if I tested ImageMap extension as a proofreading tool: take a look here: http://it.wikisource.org/wiki/Pagina:Vettura_a_vapore_del_signor_Dietz.djvu/...
Presently I'm building a python DjvuDsed "object", containing any information about the whole text layer and annotations and informations of a djvu file, and I'm adding, one by one, methods and attributes such a formidable object. I'll care for your ideas while going on.
Alex
wikisource-l@lists.wikimedia.org