The Swedish Wikisource is copying scanned books from various sources. You typically find a PDF or DJVU file, containing both scanned images and raw OCR text, that you upload to Commons, create an Index: page with the <pagelist/> tag.
Some of these books have pretty miserable OCR text, perhaps because the Norwegian National library scanned a Swedish book with their OCR software set to Norwegian. Somebody with an OCR program needs to run a new OCR on these images. Fortunately, it is quite easy to feed the PDF or DJVU file into an OCR program such as Finereader, and use a bot to update the pages. We now have one user on sv.wikisource doing this.
For these Index: pages, I created a category:OCR-kö (meaning: queue of OCR requests). When trying to interwiki link, I found a similar category on de.wikisource, but similar categories on fr, en, and pt had been removed.
What's the story behind that? Don't you need OCR requests in these languages? The comment on the English page mentions an OCR robot on the toolserver. Really?
Exist:
http://de.wikisource.org/wiki/Kategorie:OCR-Anfragen http://sv.wikisource.org/wiki/Kategori:OCR-k%C3%B6
Have been removed in June 2009:
http://en.wikisource.org/wiki/Category:OCR_Requests http://fr.wikisource.org/wiki/Cat%C3%A9gorie:Demandes_d%27OCR http://pt.wikisource.org/wiki/Categoria:!Pedidos_de_OCR
The categories you mention were used by my robot, in order to perform on-demand OCR. These categories are deprecated, the new version of the OCR service does not need them anymore. Instead, it uses an OCR button in the toolbar, and it works with Ajax. This is why categories were deleted.
The new Ajax-OCR service does not use a robot to create pages; it sends the OCR text directly in the edit box, so that it can be proofread by the user. It has its own job queue, which can be seen here : http://toolserver.org/~thomasv/ocr.php
Please note that the Ajax OCR service is something different from djvu text layer extraction. Whenever possible, you should use the djvu text layer of the file, not the OCR button. If you want to upload or improve OCR, you should update the OCR layer of the DjVu file. Thus, you do not need to create dozens of pages with a robot. It is better to avoid creating raw ocr pages with a robot, because you don't know if/when they will be proofread.
Thomas
Lars Aronsson a écrit :
The Swedish Wikisource is copying scanned books from various sources. You typically find a PDF or DJVU file, containing both scanned images and raw OCR text, that you upload to Commons, create an Index: page with the <pagelist/> tag.
Some of these books have pretty miserable OCR text, perhaps because the Norwegian National library scanned a Swedish book with their OCR software set to Norwegian. Somebody with an OCR program needs to run a new OCR on these images. Fortunately, it is quite easy to feed the PDF or DJVU file into an OCR program such as Finereader, and use a bot to update the pages. We now have one user on sv.wikisource doing this.
For these Index: pages, I created a category:OCR-kö (meaning: queue of OCR requests). When trying to interwiki link, I found a similar category on de.wikisource, but similar categories on fr, en, and pt had been removed.
What's the story behind that? Don't you need OCR requests in these languages? The comment on the English page mentions an OCR robot on the toolserver. Really?
Exist:
http://de.wikisource.org/wiki/Kategorie:OCR-Anfragen http://sv.wikisource.org/wiki/Kategori:OCR-k%C3%B6
Have been removed in June 2009:
http://en.wikisource.org/wiki/Category:OCR_Requests http://fr.wikisource.org/wiki/Cat%C3%A9gorie:Demandes_d%27OCR http://pt.wikisource.org/wiki/Categoria:!Pedidos_de_OCR
wikisource-l@lists.wikimedia.org