OCR requests - Wikisource-l

21 Apr 2010

The Swedish Wikisource is copying scanned books from
various sources. You typically find a PDF or DJVU file,
containing both scanned images and raw OCR text,
that you upload to Commons, create an Index: page
with the <pagelist/> tag.

Some of these books have pretty miserable OCR text,
perhaps because the Norwegian National library scanned
a Swedish book with their OCR software set to Norwegian.
Somebody with an OCR program needs to run a new OCR
on these images. Fortunately, it is quite easy to
feed the PDF or DJVU file into an OCR program such
as Finereader, and use a bot to update the pages.
We now have one user on sv.wikisource doing this.

For these Index: pages, I created a category:OCR-kö
(meaning: queue of OCR requests). When trying to interwiki
link, I found a similar category on de.wikisource, but
similar categories on fr, en, and pt had been removed.

What's the story behind that? Don't you need OCR
requests in these languages? The comment on the English
page mentions an OCR robot on the toolserver. Really?

Exist:

http://de.wikisource.org/wiki/Kategorie:OCR-Anfragen
http://sv.wikisource.org/wiki/Kategori:OCR-k%C3%B6

Have been removed in June 2009:

http://en.wikisource.org/wiki/Category:OCR_Requests
http://fr.wikisource.org/wiki/Cat%C3%A9gorie:Demandes_d%27OCR
http://pt.wikisource.org/wiki/Categoria:!Pedidos_de_OCR

-- 
  Lars Aronsson (lars(a)aronsson.se)
  Aronsson Datateknik - http://aronsson.se