On Jan 21, 2008 11:59 AM, ThomasV thomasV1@gmx.de wrote:
Nice idea. Note that we now have an ocr server running Tesseract. It is linked to Proofreadpage (and it works erratically)
I've found tesseract alone to be fairly erratic for real documents. Ocropus makes it behave much better.
Questions : are the bbox coordinates generated by the ocr engine ?
Yep.
in that case, what happense if the ocr outputs an incorrect number of lines ?
You could manually correct the corrds, or simply add your text to the nearest line.. which would be incorrect but better than no markup at all.
also, I think you do need a javascript hack for the edit box; what happens if the user creates a new line ?
The user can do whatever he wants... if the results don't match reality the djvus will act a bit weird. I could easily enough make a bot that will scan documents for document body text outside of line-spans and tag the pages for OCR markup improvements.
With the current ocropus code on these documents I'm unable to find any totally missed lines. While I'm sure they will happen, I wouldn't want to do the imports unless they were rare enough that the inconvenience of dealing with them is a deal breaker.