On 26/04/13 19:38, Bjoern Hoehrmann wrote:
* Andrea Zanni wrote:
> At the moment, Wikisource could be a interesting corpora and laboratory for
> improving and enhancing OCR,
> as the OCR generated text is always proofread and corrected by humans.
Try also Distributed Proofreaders. It is my impression that Wikisource's
proofreading standards are not always up to par.
As part of our
project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.
[1] We thought about this both for OCR
enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).
I built various tools that could be fairly easily adapted for this, my
http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
notes are available. One of the tools for instance is a diff tool, see
image at <http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031>.
This is a very interesting approach :)