I've been scanning books since the 1990s and thought that
OCR of Blackletter (Fraktur) was a problem that someone
else would solve, so I didn't have to. For books in normal
typography, I'm using ABBYY Finereader with great success.
Feeling comfortable with Finereader, I have not really
followed the development of the free software Tesseract.
Every time I tried it, it performed worse than Finereader.
I know Finereader can be trained to read Fraktur, but
this is a lot of work and only works for one Fraktur font
at a time. I also know there is (has been) a special
version of Finereader that reads Fraktur, and that some
library projects use.
Recently I tried Tesseract again, now in version 4.0,
and found to my surprise that it worked quite well for
Fraktur in Danish and Swedish, using the separate
configuration files dan_frak and swe-frak. (The Danish
version also reads Norwegian, which in the 19th century
was very similar to Danish.)
However: It doesn't work at all for Finnish text, and
reading Swedish seems to be a lot slower than Danish.
Is there anybody who knows these things and can answer
how the Swedish reading of Fraktur can be improved to
match the Danish, and how a Finnish version can be
created? I can provide quite a lot of training data
in the form of scanned books and proofread text.
Is there an active mailing list or web forum for
Fraktur issues with Tesseract?
Lars Aronsson (lars(a)aronsson.se)
Project Runeberg - free Nordic literature - http://runeberg.org/