[Wikisource-l] Re: OCR in 2023

15 Jan 2023


      On 2023-01-15 00:26, Shrinivasan T wrote:
...
Yes. The recent tesseract is doing good on OCR.
It uses machine learning technologies to train and giving better
results with recent versions.
I find Tesseract useful for books in good print quality with
near modern spelling, but for old print (in my case Swedish and
Danish blackletter or "Fraktur" style), it performs poorly.
There are some 3rd-party tessdata files for this (swe-frak,
dan_frak), but they don't do a good job.
Have you been working on training Tesseract for new fonts and
languages, or have you only been using the pre-trained languages?
As proofreading progresses, year after year, we should be able
to retrain the OCR software and improve its performance.
But I don't hear about any such progress.
-- 
   Lars Aronsson (lars@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikisource-l] Re: OCR in 2023