Fraktur OCR with Tesseract - Wikisource-l

15 Apr 2019

I've been scanning books since the 1990s and thought that
OCR of Blackletter (Fraktur) was a problem that someone
else would solve, so I didn't have to. For books in normal
typography, I'm using ABBYY Finereader with great success.
Feeling comfortable with Finereader, I have not really
followed the development of the free software Tesseract.
Every time I tried it, it performed worse than Finereader.

I know Finereader can be trained to read Fraktur, but
this is a lot of work and only works for one Fraktur font
at a time. I also know there is (has been) a special
version of Finereader that reads Fraktur, and that some
library projects use.

Recently I tried Tesseract again, now in version 4.0,
and found to my surprise that it worked quite well for
Fraktur in Danish and Swedish, using the separate
configuration files dan_frak and swe-frak. (The Danish
version also reads Norwegian, which in the 19th century
was very similar to Danish.)

However: It doesn't work at all for Finnish text, and
reading Swedish seems to be a lot slower than Danish.

Is there anybody who knows these things and can answer
how the Swedish reading of Fraktur can be improved to
match the Danish, and how a Finnish version can be
created? I can provide quite a lot of training data
in the form of scanned books and proofread text.

Is there an active mailing list or web forum for
Fraktur issues with Tesseract?

-- 
   Lars Aronsson (lars(a)aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/