On Jan 21, 2008 11:59 AM, ThomasV <thomasV1(a)gmx.de> wrote:
Nice idea. Note that we now have an ocr server running Tesseract.
It is linked to Proofreadpage (and it works erratically)
I've found tesseract alone to be fairly erratic for real documents.
Ocropus makes it behave much better.
Questions : are the bbox coordinates generated by the
ocr engine ?
Yep.
in that case, what happense if the ocr outputs an
incorrect number
of lines ?
You could manually correct the corrds, or simply add your text to the
nearest line.. which would be incorrect but better than no markup at
all.
also, I think you do need a javascript hack for the
edit box;
what happens if the user creates a new line ?
The user can do whatever he wants... if the results don't match
reality the djvus will act a bit weird. I could easily enough make a
bot that will scan documents for document body text outside of
line-spans and tag the pages for OCR markup improvements.
With the current ocropus code on these documents I'm unable to find
any totally missed lines. While I'm sure they will happen, I wouldn't
want to do the imports unless they were rare enough that the
inconvenience of dealing with them is a deal breaker.