On Jan 27, 2008 9:55 AM, Lars Aronsson <lars(a)aronsson.se> wrote:
Gregory Maxwell wrote:
I'd really like it if the corrected text in
wikisource could be
imported back into the djvu document images.
Some thoughts:
1. The easy way to do OCR is not to do OCR. If you download books
scanned by the Internet Archive / Open Content Alliance, they are
already OCRed. Both images and raw OCR text are contained in the
djvu files. I think IA uses OCR technology from H-P that isn't
open sourced.
The software that Gregory Maxwell is planning to use is the same that
is used for Google Books. It is open source.
2. It is nice to have pixel coordinates for each word
or line of
text, but this requires that the image is kept unchanged. If the
scanned image is uploaded to Wikimedia Commons, some helpful user
might touch it up, deskew it, improve the contrast and upload a
new version, after which all pixel coordinates might be ruined.
The page scans will be uploaded to Commons as DJVU files, which are
huge, and we dont really want regular updates to them.
I think the way this would be handled is as separate images.
e.g. If I cleaned up page 1 of [[Image:35 Sonnets by Fernando
Pessoa.djvu]], I save it as [[Image:35 Sonnets by Fernando
Pessoa.djvu-page1.jpg]], and I update [[Index:35 Sonnets]] to use the
standalone image instead of the DJVU. Once completed, any standaline
images would then be used to rebuild the DJVU file.
3. As you mentioned, there are now some open sourced
OCR engines.
I haven't tried them, but I assume they will improve and become
useful. The traditional use for OCR is to read an image and
output raw text, but proofreading has traditionally been a
one-person process with very limited feedback. When collaborative
proofreading (as in
PGDP.net or Wikisource) is combined with open
sourced OCR software, we have a new potential feedback loop.
Instead of finding the words in an image, we could need a routine
that takes a scanned image and an already proofread text, and
tries to find the pixel coordinates for these words. If that sort
of software existed, we wouldn't need to preserve coordinates
during proofreading, because we could reconstruct them afterwards.
This might be a suitable summer-of-code project for the right
person, who is already familiar with the OCR software.
Finding the location of a given text on an image is a novel idea. It
is an interesting project that might even be suitable as a research
project for a post-grad.
If I understand correctly, you are suggesting that Greg uploads the
DJVU without a text layer, and we all use whatever means we have
available to build create the text, and then we feed the proofread
text into the DJVU once complete (using vaporware software? :-) )
This has the distinct advantage of allowing the images to be improved
as well as the text. We may even be able to push an improved DJVU
file onto the Commons front page as a featured image.
As crazy as it sounds, it is quite sane. OCR software will improve
over the next year, and we want to be taking advantage of those
improvements as we progress through the volumes. ThomasV has already
set up the framework for user requested bot scans; we may need to
extend this to handle different configurations to suit each DJVU file,
so that the OCR software can "learn" as it progresses.
Also, distributing the low quality OCR text in the DJVU file initially
results in a many potential contributors not joining the ranks because
the OCR text is "good enough". By not including the OCR text, it will
encourage people to work with Wikisource to finish each volume.
--
John