If you have a large digitization project, such as Wikisource, with many pages and books of scanned images and OCR text (originating from different sources and times), how do you assess the OCR quality and determine which pages are in most need of improved OCR or proofreading?
Is spell checking (and a normal dictionary) the only useful tool? Would you count the number of spelling errors, or the ratio of errors to correct words? Has anyone done this?
I don't know how it works inside Wikisource, but at the very least Tesseract has a confidence value (also called confidence score or level) that will score how well it did OCR on a word (it also works at character level). But for assessing that you normally need the hOCR result.
cheers,
El mar., 12 mar. 2019 a las 17:27, Lars Aronsson (lars@aronsson.se) escribió:
If you have a large digitization project, such as Wikisource, with many pages and books of scanned images and OCR text (originating from different sources and times), how do you assess the OCR quality and determine which pages are in most need of improved OCR or proofreading?
Is spell checking (and a normal dictionary) the only useful tool? Would you count the number of spelling errors, or the ratio of errors to correct words? Has anyone done this?
-- Lars Aronsson (lars@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Lars Aronsson, 12/03/19 22:27:
Is spell checking (and a normal dictionary) the only useful tool?
I'm not sure it's the only or most useful, but it's definitely common.
Would you count the number of spelling errors, or the ratio of errors to correct words? Has anyone done this?
It's routinely done by OCR software, and in fact if I remember correctly such information is stored in DjVu files (the uncertain words are marked).
In practice, such information is most useful when preparing the files for upload to Wikisource, because you can minimise your manual work by checking that the OCR was mostly successful and if not try with different settings. I thought we had such information in https://en.wikisource.org/wiki/Category:File_creation_help but seems not.
We discussed this in the long past but I don't remember when. https://strategy.wikimedia.org/wiki/Proposal:Make_Wikisource_scale reminds me that https://www.nla.gov.au/australian-newspaper-plan had an impressive crowdsourcing a decade ago, but I don't see whether it died.
OCR assessment is a well-researched issue so you'll find things like https://www.digitisation.eu/glossary/ground-truth/ but not so much about how to organise a transcription project: http://succeed-project.eu/wiki/index.php/TPDL_Tutorial_State-of-the-art_tools_for_text_digitisation#2._OCR_and_Post-correction claims that VTL was the most promising method back then, but in 2015 we found it was over in 2014. https://lists.wikimedia.org/pipermail/wikisource-l/2015-October/002516.html
Federico
wikisource-l@lists.wikimedia.org