On 11/24/2014 11:13 PM, Federico Leva (Nemo) wrote:
When I think of this, I agree that OCR is the main
issue. But it's not
necessarily the one which worries me most, because tesseract is
something living outside the wiki which can be improved even if the
wiki has design issues. If we try really hard, we may face unsolvable
integration problems in the OCR<->DjVU<->Wikisource food chain; but so
far the issue is rather that we never tried seriously.[1]
[1]
https://www.mediawiki.org/wiki/CAPTCHA
The problem is that we are stuck in the notion that
"it must be a wiki". The wiki is just one tool. Captchas
could be another. The goal is to make the contents of
books available in a more correct, more reliable and
useful form. To scale things up, we should have the
ambition to handle all books in the Internet Archive.
(Books from other sources, such as Google, can be
copied to the Internet Archive.)
Our use of OCR today is indeed "outside the wiki", it
is a one-time operation to us. But it shouldn't be. When
a book page is proofread, the OCR software should
learn from this. Aha, it wasn't "arn", it was "am". And
when the OCR software has improved, all other pages
should be evaluated again. Maybe the arn/am error
was found in more places? It sounds like an impossible
job to process millions of pages again every day, but
that's where an algorithm designer starts. Maybe we
can index the patterns, so all possible arn/am patterns
can be found in a second and quickly reprocessed.
As you proofread one page, a hundred other pages
in dozens of books are also improved. With this kind
of application in mind, a wiki to proofread one page
or a captcha to proofread one word are just two
kinds of tools to collect the human contribution to the
improvement of the OCR engine and to the library.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se