Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа:
I believe the trickiest part is creating a system to
put results back
in Wikisource in a semi-automated way, but having "captcha reviewers"
may help.
OCRs generally work by finding lines of text on a page, splitting the lines
into letters, then recognizing each letter separately. So, an OCR would know,
for each letter of the recognized text, what is its bounding box on the page.
However, to my knowledge there is not a single OCR that exports this data, nor
is there a standard format for it. If an open source OCR could be modified to
do this, then it would be easy to inject data retreieved from captchas back
into OCR-ed text. And it could be used for so much more :)