I don't know how it works inside Wikisource, but at the very least Tesseract has a confidence value (also called confidence score or level) that will score how well it did OCR on a word (it also works at character level). But for assessing that you normally need the hOCR result.

cheers,

El mar., 12 mar. 2019 a las 17:27, Lars Aronsson (<lars@aronsson.se>) escribió:

If you have a large digitization project, such as Wikisource,
with many pages and books of scanned images and OCR text
(originating from different sources and times),
how do you assess the OCR quality and determine which pages
are in most need of improved OCR or proofreading?

Is spell checking (and a normal dictionary) the only useful tool?
Would you count the number of spelling errors, or the ratio
of errors to correct words? Has anyone done this?

--
Lars Aronsson (lars@aronsson.se)
Project Runeberg - free Nordic literature - http://runeberg.org/

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l