[Wikisource-l] On linking Wikisource with page images

Sat Jan 26 22:55:22 UTC 2008

Gregory Maxwell wrote:

> I'd really like it if the corrected text in wikisource could be 
> imported back into the djvu document images.

Some thoughts:

1. The easy way to do OCR is not to do OCR.  If you download books 
scanned by the Internet Archive / Open Content Alliance, they are 
already OCRed.  Both images and raw OCR text are contained in the 
djvu files.  I think IA uses OCR technology from H-P that isn't 
open sourced.

2. It is nice to have pixel coordinates for each word or line of 
text, but this requires that the image is kept unchanged.  If the 
scanned image is uploaded to Wikimedia Commons, some helpful user 
might touch it up, deskew it, improve the contrast and upload a 
new version, after which all pixel coordinates might be ruined.

3. As you mentioned, there are now some open sourced OCR engines.  
I haven't tried them, but I assume they will improve and become 
useful.  The traditional use for OCR is to read an image and 
output raw text, but proofreading has traditionally been a 
one-person process with very limited feedback.  When collaborative 
proofreading (as in PGDP.net or Wikisource) is combined with open 
sourced OCR software, we have a new potential feedback loop.  
Instead of finding the words in an image, we could need a routine 
that takes a scanned image and an already proofread text, and 
tries to find the pixel coordinates for these words.  If that sort 
of software existed, we wouldn't need to preserve coordinates 
during proofreading, because we could reconstruct them afterwards.  
This might be a suitable summer-of-code project for the right 
person, who is already familiar with the OCR software.

-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se