Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа:
I believe the trickiest part is creating a system to put results back in Wikisource in a semi-automated way, but having "captcha reviewers" may help.
OCRs generally work by finding lines of text on a page, splitting the lines into letters, then recognizing each letter separately. So, an OCR would know, for each letter of the recognized text, what is its bounding box on the page.
However, to my knowledge there is not a single OCR that exports this data, nor is there a standard format for it. If an open source OCR could be modified to do this, then it would be easy to inject data retreieved from captchas back into OCR-ed text. And it could be used for so much more :)