The categories you mention were used by my robot, in order to perform
on-demand OCR. These categories are deprecated, the new version of the
OCR service does not need them anymore. Instead, it uses an OCR button
in the toolbar, and it works with Ajax. This is why categories were deleted.
The new Ajax-OCR service does not use a robot to create pages; it sends
the OCR text directly in the edit box, so that it can be proofread by
the user. It has its own job queue, which can be seen here :
Please note that the Ajax OCR service is something different from djvu
text layer extraction. Whenever possible, you should use the djvu text
layer of the file, not the OCR button. If you want to upload or improve
OCR, you should update the OCR layer of the DjVu file. Thus, you do not
need to create dozens of pages with a robot. It is better to avoid
creating raw ocr pages with a robot, because you don't know if/when they
will be proofread.
Lars Aronsson a écrit :
The Swedish Wikisource is copying scanned books from
various sources. You typically find a PDF or DJVU file,
containing both scanned images and raw OCR text,
that you upload to Commons, create an Index: page
with the <pagelist/> tag.
Some of these books have pretty miserable OCR text,
perhaps because the Norwegian National library scanned
a Swedish book with their OCR software set to Norwegian.
Somebody with an OCR program needs to run a new OCR
on these images. Fortunately, it is quite easy to
feed the PDF or DJVU file into an OCR program such
as Finereader, and use a bot to update the pages.
We now have one user on sv.wikisource doing this.
For these Index: pages, I created a category:OCR-kö
(meaning: queue of OCR requests). When trying to interwiki
link, I found a similar category on de.wikisource, but
similar categories on fr, en, and pt had been removed.
What's the story behind that? Don't you need OCR
requests in these languages? The comment on the English
page mentions an OCR robot on the toolserver. Really?
Have been removed in June 2009: