Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance

25 Nov 2014


      On 11/24/2014 11:13 PM, Federico Leva (Nemo) wrote:
...
When I think of this, I agree that OCR is the main issue. But it's not 
necessarily the one which worries me most, because tesseract is 
something living outside the wiki which can be improved even if the 
wiki has design issues. If we try really hard, we may face unsolvable 
integration problems in the OCR<->DjVU<->Wikisource food chain; but so 
far the issue is rather that we never tried seriously.[1]
[1] https://www.mediawiki.org/wiki/CAPTCHA
The problem is that we are stuck in the notion that
"it must be a wiki". The wiki is just one tool. Captchas
could be another. The goal is to make the contents of
books available in a more correct, more reliable and
useful form. To scale things up, we should have the
ambition to handle all books in the Internet Archive.
(Books from other sources, such as Google, can be
copied to the Internet Archive.)
Our use of OCR today is indeed "outside the wiki", it
is a one-time operation to us. But it shouldn't be. When
a book page is proofread, the OCR software should
learn from this. Aha, it wasn't "arn", it was "am". And
when the OCR software has improved, all other pages
should be evaluated again. Maybe the arn/am error
was found in more places? It sounds like an impossible
job to process millions of pages again every day, but
that's where an algorithm designer starts. Maybe we
can index the patterns, so all possible arn/am patterns
can be found in a second and quickly reprocessed.
As you proofread one page, a hundred other pages
in dozens of books are also improved. With this kind
of application in mind, a wiki to proofread one page
or a captcha to proofread one word are just two
kinds of tools to collect the human contribution to the
improvement of the OCR engine and to the library.
-- 
   Lars Aronsson (lars@aronsson.se)
   Aronsson Datateknik - http://aronsson.se

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance