Lars Aronsson a écrit :
Hi Thomas,
The new Ajax-OCR service does not use a robot to create pages; it sends the OCR text directly in the edit box, so that it can be proofread by the user. It has its own job queue, which can be seen here : http://toolserver.org/~thomasv/ocr.php
Oh, great, I didn't know this existed. For which languages does it work? Does it scale to other languages? Does your OCR engine get any feedback from proofreading? Is there any documentation of how this works?
it is installed at 11 subdomains. see OCR.js at http://wikisource.org/wiki/Wikisource:Shared_Scripts
If you want to upload or improve OCR, you should update the OCR layer of the DjVu file. Thus, you do not need to create dozens of pages with a robot.
Yes, in my dreams. And ultimately the new Djvu would be fed back to the place (Internet Archive, Google, ...) where it came from. But how would this work?
If (in my science fiction dreams) Commons had an API that would accept new OCR for a page of a Djvu file, your Ajax routine as well as the standard proofreading form could do this right away. One major problem is that our proofreading (and some OCR software) loses the image coordinates of the words in the text.
this is not a dream, it is a very common operation. I believe there is a help page at en.ws, that describes how to update the text layer of a djvu file. once you've done this, you just need to upload the modified djvu as a new version of the file. The fact that image coordinates are lost in the process is not a problem for wikisource.