On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <zanni.andrea84(a)gmail.com>
wrote:
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).
Yes, the output is generally good. But as far as I can tell, the archive's
Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item. What I'm working on requires the ability to OCR a single page
on demand.
But ideally we could think of a "customizable" OCR software that gets
trained language per language: htat would be extremely
useful for
Wiikisources.
(i can also imagine to divide, inside every language, per centuries,
because languages too changes over time ;-)
Indeed.
A.
--
Asaf Bartov
Wikimedia Foundation <http://www.wikimediafoundation.org>
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org