On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <zanni.andrea84@gmail.com> wrote:
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good.  But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item.  What I'm working on requires the ability to OCR a single page on demand.

But ideally we could think of a "customizable" OCR software that gets trained language per language: htat would be extremely useful for Wiikisources.

(i can also imagine to divide, inside every language, per centuries, because languages too changes over time ;-)

Indeed.

   A.
--
    Asaf Bartov
    Wikimedia Foundation

Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality!