Gregory Maxwell wrote:
I'd really like it if the corrected text in wikisource could be imported back into the djvu document images.
Some thoughts:
1. The easy way to do OCR is not to do OCR. If you download books scanned by the Internet Archive / Open Content Alliance, they are already OCRed. Both images and raw OCR text are contained in the djvu files. I think IA uses OCR technology from H-P that isn't open sourced.
2. It is nice to have pixel coordinates for each word or line of text, but this requires that the image is kept unchanged. If the scanned image is uploaded to Wikimedia Commons, some helpful user might touch it up, deskew it, improve the contrast and upload a new version, after which all pixel coordinates might be ruined.
3. As you mentioned, there are now some open sourced OCR engines. I haven't tried them, but I assume they will improve and become useful. The traditional use for OCR is to read an image and output raw text, but proofreading has traditionally been a one-person process with very limited feedback. When collaborative proofreading (as in PGDP.net or Wikisource) is combined with open sourced OCR software, we have a new potential feedback loop. Instead of finding the words in an image, we could need a routine that takes a scanned image and an already proofread text, and tries to find the pixel coordinates for these words. If that sort of software existed, we wouldn't need to preserve coordinates during proofreading, because we could reconstruct them afterwards. This might be a suitable summer-of-code project for the right person, who is already familiar with the OCR software.