On 16/10/06, Yann Forget <yann(a)forget-me.net> wrote:
2. While OCR
capacities exist for some languages, they do not exist for
other languages, where the material is much more likely to get lost.
Manuscripts in Tibetan monasteries, for example, can be scanend but not
OCRed easily. To make this information available, developers should be
paid to create adequate OCR tools for these languages. Rough cost: $5
million.
Much of the limits of Wikisource now is on the capability to scan and
ocr documents. There is no good free OCR software, apart the new
software recently released to GPL by Google, but it works only for
English and has still limitations. So developing a good free and
multilingual OCR software would be my priority. AFAIK there is no good
OCR software (free or not) for any Indian languages, including Sanskrit.
I have never seen any for Tibetan either.
But having a software is not enough. A few OCR servers managed by the
Foundation where anyone can sent an automated OCR request would be very
useful. There are already proprietary OCR software who can do that.
This is a very, very, very good idea. Having a dedicated system to
input TIFF images (or the like) and spit out high-grade OCR, rather
than just relying on whatever the scanning volunteer can come up with,
would help the wikisource-like projects leap ahead.
...has anyone proposed this to Project Gutenberg? If they can get the
money together, it might free up an *awful* lot of their volunteer
time.
--
- Andrew Gray
andrew.gray(a)dunelm.org.uk