I completely agree with Lars. I remember, for example, an awesome tool from Alex Brollo, postOCR, a js script which corrects automatically most common OCR errors and converts apostrophes. The tool is very useful and very used, and it would improve a lot from a given list of common OCR errors per book.
Moreover, a set of stats per books (list of words used, counting those words, etc.) could be very interesting for a tiny range of readers, but skilled ones, as digital humanists and philologists.
As an example, we are collaborating right now with a philologist (a digital humanist) who put text on Wikisource, proofread them with the community, and then works on them.
Aubrey
On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson lars@aronsson.se wrote:
It should be possible, in any language of Wikisource, to check all existing text against a known dictionary valid for that year, and to find words that are outside the dictionary. These words could be proofread in some tool similar to a CAPTCHA. They might be uncommon place names that are correctly OCRed but not in the dictionary, or they could be OCR errors, or both.
Has anybody tried this?
Such finds are not necessarily the only OCR errors. Some OCR errors result in correctly spelled words, that are found in the dictionary, e.g. burn -> bum. So full manual proofreading and validation will still be needed. But a statistics based approach could fill gaps and quickly improve full text searchability.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
______________________________**_________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.**org Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikisource-lhttps://lists.wikimedia.org/mailman/listinfo/wikisource-l