Re: [Wikisource-l] On linking Wikisource with page images

27 Jan 2008


      Gregory Maxwell wrote:
...
I'd really like it if the corrected text in wikisource could be 
imported back into the djvu document images.
Some thoughts:
1. The easy way to do OCR is not to do OCR.  If you download books 
scanned by the Internet Archive / Open Content Alliance, they are 
already OCRed.  Both images and raw OCR text are contained in the 
djvu files.  I think IA uses OCR technology from H-P that isn't 
open sourced.
2. It is nice to have pixel coordinates for each word or line of 
text, but this requires that the image is kept unchanged.  If the 
scanned image is uploaded to Wikimedia Commons, some helpful user 
might touch it up, deskew it, improve the contrast and upload a 
new version, after which all pixel coordinates might be ruined.
3. As you mentioned, there are now some open sourced OCR engines.  
I haven't tried them, but I assume they will improve and become 
useful.  The traditional use for OCR is to read an image and 
output raw text, but proofreading has traditionally been a 
one-person process with very limited feedback.  When collaborative 
proofreading (as in PGDP.net or Wikisource) is combined with open 
sourced OCR software, we have a new potential feedback loop.  
Instead of finding the words in an image, we could need a routine 
that takes a scanned image and an already proofread text, and 
tries to find the pixel coordinates for these words.  If that sort 
of software existed, we wouldn't need to preserve coordinates 
during proofreading, because we could reconstruct them afterwards.  
This might be a suitable summer-of-code project for the right 
person, who is already familiar with the OCR software.
-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] On linking Wikisource with page images