Re: [Wikipedia-l] [Wikisource-l] [Commons-l] Dream a little...

16 Oct 2006

On 16/10/06, Yann Forget &lt;yann(a)forget-me.net&gt; wrote:

...
   2. While OCR
capacities exist for some languages, they do not exist for
 other languages, where the material is much more likely to get lost.
 Manuscripts in Tibetan monasteries, for example, can be scanend but not
 OCRed easily. To make this information available, developers should be
 paid to create adequate OCR tools for these languages. Rough cost: $5
 million. 
 Much of the limits of Wikisource now is on the capability to scan and
 ocr documents. There is no good free OCR software, apart the new
 software recently released to GPL by Google, but it works only for
 English and has still limitations. So developing a good free and
 multilingual OCR software would be my priority. AFAIK there is no good
 OCR software (free or not) for any Indian languages, including Sanskrit.
 I have never seen any for Tibetan either.

 But having a software is not enough. A few OCR servers managed by the
 Foundation where anyone can sent an automated OCR request would be very
 useful. There are already proprietary OCR software who can do that. 
This is a very, very, very good idea. Having a dedicated system to
input TIFF images (or the like) and spit out high-grade OCR, rather
than just relying on whatever the scanning volunteer can come up with,
would help the wikisource-like projects leap ahead.

...has anyone proposed this to Project Gutenberg? If they can get the
money together, it might free up an *awful* lot of their volunteer
time.

-- 
- Andrew Gray
  andrew.gray(a)dunelm.org.uk

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] [Wikisource-l] [Commons-l] Dream a little...