On 16/10/06, Yann Forget yann@forget-me.net wrote:
- While OCR capacities exist for some languages, they do not exist for
other languages, where the material is much more likely to get lost. Manuscripts in Tibetan monasteries, for example, can be scanend but not OCRed easily. To make this information available, developers should be paid to create adequate OCR tools for these languages. Rough cost: $5 million.
Much of the limits of Wikisource now is on the capability to scan and ocr documents. There is no good free OCR software, apart the new software recently released to GPL by Google, but it works only for English and has still limitations. So developing a good free and multilingual OCR software would be my priority. AFAIK there is no good OCR software (free or not) for any Indian languages, including Sanskrit. I have never seen any for Tibetan either.
But having a software is not enough. A few OCR servers managed by the Foundation where anyone can sent an automated OCR request would be very useful. There are already proprietary OCR software who can do that.
This is a very, very, very good idea. Having a dedicated system to input TIFF images (or the like) and spit out high-grade OCR, rather than just relying on whatever the scanning volunteer can come up with, would help the wikisource-like projects leap ahead.
...has anyone proposed this to Project Gutenberg? If they can get the money together, it might free up an *awful* lot of their volunteer time.