[Wikisource-l] Fraktur OCR with Tesseract

15 Apr 2019


      I've been scanning books since the 1990s and thought that
OCR of Blackletter (Fraktur) was a problem that someone
else would solve, so I didn't have to. For books in normal
typography, I'm using ABBYY Finereader with great success.
Feeling comfortable with Finereader, I have not really
followed the development of the free software Tesseract.
Every time I tried it, it performed worse than Finereader.
I know Finereader can be trained to read Fraktur, but
this is a lot of work and only works for one Fraktur font
at a time. I also know there is (has been) a special
version of Finereader that reads Fraktur, and that some
library projects use.
Recently I tried Tesseract again, now in version 4.0,
and found to my surprise that it worked quite well for
Fraktur in Danish and Swedish, using the separate
configuration files dan_frak and swe-frak. (The Danish
version also reads Norwegian, which in the 19th century
was very similar to Danish.)
However: It doesn't work at all for Finnish text, and
reading Swedish seems to be a lot slower than Danish.
Is there anybody who knows these things and can answer
how the Swedish reading of Fraktur can be improved to
match the Danish, and how a Finnish version can be
created? I can provide quite a lot of training data
in the form of scanned books and proofread text.
Is there an active mailing list or web forum for
Fraktur issues with Tesseract?
-- 
   Lars Aronsson (lars@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikisource-l] Fraktur OCR with Tesseract