OCR is an old problem. There is commercial software, such as Finereader, and free software such as Tesseract. But is there also a new trend in home-built software based on new frameworks for neural networks and deep learning? Keras? TensorFlow? Is anybody experimenting with this for OCR of scanned books?
When I ask researchers in image processing / computer vision, they say that plain text (book) OCR "is a solved problem" that nobody researches, and all research goes into self-driving cars reading street signs. Is this true, or are there any exceptions?
சனி, 14 ஜன., 2023, பிற்பகல் 3:20 அன்று, Lars Aronsson lars@aronsson.se எழுதியது:
OCR is an old problem. There is commercial software, such as Finereader, and free software such as Tesseract. But is there also a new trend in home-built software based on new frameworks for neural networks and deep learning? Keras? TensorFlow? Is anybody experimenting with this for OCR of scanned books?
When I ask researchers in image processing / computer vision, they say that plain text (book) OCR "is a solved problem" that nobody researches, and all research goes into self-driving cars reading street signs. Is this true, or are there any exceptions?
Yes. The recent tesseract is doing good on OCR. It uses machine learning technologies to train and giving better results with recent versions.
We have its improved proprietary version as google vision api, which provides little better results sometimes.
Here is a implementation of connecting wikisource and OCR via google drive. ( wrote this on 2015) https://github.com/tshrinivasan/OCR4wikisource
We used it many indic wikisource sites around 2016-2020.
On 2023-01-15 00:26, Shrinivasan T wrote:
Yes. The recent tesseract is doing good on OCR. It uses machine learning technologies to train and giving better results with recent versions.
I find Tesseract useful for books in good print quality with near modern spelling, but for old print (in my case Swedish and Danish blackletter or "Fraktur" style), it performs poorly. There are some 3rd-party tessdata files for this (swe-frak, dan_frak), but they don't do a good job.
Have you been working on training Tesseract for new fonts and languages, or have you only been using the pre-trained languages?
As proofreading progresses, year after year, we should be able to retrain the OCR software and improve its performance. But I don't hear about any such progress.
wikisource-l@lists.wikimedia.org