Re: [Wikisource-l] On linking Wikisource with page images

22 Jan 2008


      On Jan 21, 2008 11:59 AM, ThomasV thomasV1@gmx.de wrote:
...
Nice idea. Note that we now have an ocr server running Tesseract.
It is linked to Proofreadpage (and it works erratically)
I've found tesseract alone to be fairly erratic for real documents.
Ocropus makes it behave much better.
...
Questions : are the bbox coordinates generated by the ocr engine ?
Yep.
...
in that case, what happense if the ocr outputs an incorrect number
of lines ?
You could manually correct the corrds, or simply add your text to the
nearest line.. which would be incorrect but better than no markup at
all.
...
also, I think you do need a javascript hack for the edit box;
what happens if the user creates a new line ?
The user can do whatever he wants... if the results don't match
reality the djvus will act a bit weird. I could easily enough make a
bot that will scan documents for document body text outside of
line-spans and tag the pages for OCR markup improvements.
With the current ocropus code on these documents I'm unable to find
any totally missed lines. While I'm sure they will happen,  I wouldn't
want to do the imports unless they were rare enough that the
inconvenience of dealing with them is a deal breaker.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] On linking Wikisource with page images