On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andrea84@gmail.com wrote:
uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).
Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.
But ideally we could think of a "customizable" OCR software that gets
trained language per language: htat would be extremely useful for Wiikisources.
(i can also imagine to divide, inside every language, per centuries, because languages too changes over time ;-)
Indeed.
A.