[Wikimedia-l] The case for supporting open source machine translation

Mon Apr 29 06:29:33 UTC 2013

On 26/04/13 19:38, Bjoern Hoehrmann wrote:
> * Andrea Zanni wrote:
>> At the moment, Wikisource could be a interesting corpora and laboratory for
>> improving and enhancing OCR,
>> as the OCR generated text is always proofread and corrected by humans.

Try also Distributed Proofreaders. It is my impression that Wikisource's 
proofreading standards are not always up to par.

>> As part of our project (
>> http://wikisource.org/wiki/Wikisource_vision_development), Micru was
>> looking for a GSoC candidate for studing the reinsertion of proofread text
>> into djvus [1], but at the moment didn't find any interested student. We
>> have some contacts with people at Google working on Tesseract, and they
>> were available for mentoring.
>
>> [1] We thought about this both for OCR enhancement purposes and files
>> updating on Commons and Internet Archive (which is off topic here).
>
> I built various tools that could be fairly easily adapted for this, my
> http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
> notes are available. One of the tools for instance is a diff tool, see
> image at <http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031>.

This is a very interesting approach :)