New subject: OCR requests

22 Apr 2010

Lars Aronsson a écrit :
...
  Hi Thomas,

  The new Ajax-OCR service does not use a robot to
create pages; it sends
 the OCR text directly in the edit box, so that it can be proofread by
 the user. It has its own job queue, which can be seen here :
 http://toolserver.org/~thomasv/ocr.php 
 Oh, great, I didn't know this existed. For which languages
 does it work? Does it scale to other languages? Does your
 OCR engine get any feedback from proofreading? Is there
 any documentation of how this works?
 it is installed at 11 subdomains.
see OCR.js at http://wikisource.org/wiki/Wikisource:Shared_Scripts

...
   If you want to
upload or improve
 OCR, you should update the OCR layer of the DjVu file. Thus, you do not
 need to create dozens of pages with a robot. 
 Yes, in my dreams. And ultimately the new Djvu would
 be fed back to the place (Internet Archive, Google, ...)
 where it came from. But how would this work?

 If (in my science fiction dreams) Commons had an API
 that would accept new OCR for a page of a Djvu file,
 your Ajax routine as well as the standard proofreading
 form could do this right away. One major problem is that
 our proofreading (and some OCR software) loses the
 image coordinates of the words in the text. this is not a dream, it is a very
common operation.
I believe there is a help page at en.ws, that describes how
to update the text layer of a djvu file. once you've done this,
you just need to upload the modified djvu as a new version
of the file. The fact that image coordinates are lost in the
process is not a problem for wikisource.

Re: [Wikisource-l] OCR requests