Lars Aronsson a écrit :
Hi Thomas,
The new Ajax-OCR service does not use a robot to create pages; it sends the OCR text directly in the edit box, so that it can be proofread by the user. It has its own job queue, which can be seen here : http://toolserver.org/~thomasv/ocr.php
Oh, great, I didn't know this existed. For which languages does it work? Does it scale to other languages? Does your OCR engine get any feedback from proofreading? Is there any documentation of how this works?
it is installed at 11 subdomains. see OCR.js at http://wikisource.org/wiki/Wikisource:Shared_Scripts
If you want to upload or improve OCR, you should update the OCR layer of the DjVu file. Thus, you do not need to create dozens of pages with a robot.
Yes, in my dreams. And ultimately the new Djvu would be fed back to the place (Internet Archive, Google, ...) where it came from. But how would this work?
If (in my science fiction dreams) Commons had an API that would accept new OCR for a page of a Djvu file, your Ajax routine as well as the standard proofreading form could do this right away. One major problem is that our proofreading (and some OCR software) loses the image coordinates of the words in the text.
this is not a dream, it is a very common operation. I believe there is a help page at en.ws, that describes how to update the text layer of a djvu file. once you've done this, you just need to upload the modified djvu as a new version of the file. The fact that image coordinates are lost in the process is not a problem for wikisource.
ThomasV wrote:
The new Ajax-OCR service does not use a robot to create pages;
it is installed at 11 subdomains. see OCR.js at http://wikisource.org/wiki/Wikisource:Shared_Scripts
So, it is activated for the Norwegian Wikisource. But when I tried it, only garbage comes out:
Expected: Rom, forstår ni? — he he! — Nå ja, der nere i det
Got: Rom, forstiir ni? — he he! ·— Néja, der nere idet,
Maybe the OCR is set for French, not for Norwegian, since the output is full of accented é but no Norwegian æøå.
The idea is very nice, but how do we make this work for Scandinavian and other languages? What OCR engine do you use?
I believe there is a help page at en.ws, that describes how to update the text layer of a djvu file. once you've done this, you just need to upload the modified djvu as a new version of the file. The fact that image coordinates are lost in the process is not a problem for wikisource.
For Wikisource, losing coordinates is not a problem. But for Wikisource, updating the Djvu is not an issue. Wikisource is fine with having the text in the Page: namespace on Wikisource. It is others, external users who might want an updated Djvu file, and they might care about searching for a word and finding it in the right position of the image.
Lars Aronsson a écrit :
So, it is activated for the Norwegian Wikisource. But when I tried it, only garbage comes out:
Expected: Rom, forstår ni? — he he! — Nå ja, der nere i det
Got: Rom, forstiir ni? — he he! ·— Néja, der nere idet,
Maybe the OCR is set for French, not for Norwegian, since the output is full of accented é but no Norwegian æøå.
The idea is very nice, but how do we make this work for Scandinavian and other languages? What OCR engine do you use?
someone already asked the same question, see : http://fr.wikisource.org/wiki/Discussion_utilisateur:ThomasV#Norwegian_OCR.3...
wikisource-l@lists.wikimedia.org