Nice idea. Note that we now have an ocr server running Tesseract.
It is linked to Proofreadpage (and it works erratically)
Questions : are the bbox coordinates generated by the ocr engine ?
in that case, what happense if the ocr outputs an incorrect number
of lines ?
also, I think you do need a javascript hack for the edit box;
what happens if the user creates a new line ?
Thomas
Gregory Maxwell wrote:
What I'd like to do is
leave invisible markup generated by the ocr software in the page text,
like this:
<span class='ocr_line' title='bbox 551 4202 2666 4278 1'>The first
experiments were made on the absorption of carbonic</span> <span
class='ocr_line' title='bbox 474 4281 2668 4355 1'>acid gas by water:
and here a singular disagreement was observed</span> <span
class='ocr_line' title='bbox 471 4360 2668 4433 1'>in the first
trials
made under exactly the same circumstances. It</span>
From this the ocred text could be corrected, and
markup could be
added, but I could still take the output and apply it back to the
original document. If people feel this would frustrate editing too
much we could make some Javascript hacks to the edit box to reduce the
span tags to nothing more than an immutable <S marker.
Would this be acceptable?