On 11/24/2014 11:13 PM, Federico Leva (Nemo) wrote:
When I think of this, I agree that OCR is the main issue. But it's not necessarily the one which worries me most, because tesseract is something living outside the wiki which can be improved even if the wiki has design issues. If we try really hard, we may face unsolvable integration problems in the OCR<->DjVU<->Wikisource food chain; but so far the issue is rather that we never tried seriously.[1]
The problem is that we are stuck in the notion that "it must be a wiki". The wiki is just one tool. Captchas could be another. The goal is to make the contents of books available in a more correct, more reliable and useful form. To scale things up, we should have the ambition to handle all books in the Internet Archive. (Books from other sources, such as Google, can be copied to the Internet Archive.)
Our use of OCR today is indeed "outside the wiki", it is a one-time operation to us. But it shouldn't be. When a book page is proofread, the OCR software should learn from this. Aha, it wasn't "arn", it was "am". And when the OCR software has improved, all other pages should be evaluated again. Maybe the arn/am error was found in more places? It sounds like an impossible job to process millions of pages again every day, but that's where an algorithm designer starts. Maybe we can index the patterns, so all possible arn/am patterns can be found in a second and quickly reprocessed. As you proofread one page, a hundred other pages in dozens of books are also improved. With this kind of application in mind, a wiki to proofread one page or a captcha to proofread one word are just two kinds of tools to collect the human contribution to the improvement of the OCR engine and to the library.