Re: [Wikisource-l] Proofreading based on statistics

24 May 2013


      Lars Aronsson, 24/05/2013 01:54:
...
It should be possible, in any language of Wikisource, to
check all existing text against
What do you define as existing text? Only the text currently stored in 
wiki pages? Also the text layer of the DjVu or PDF files in use on the 
wiki? Also the files uploaded but not used yet?
...
a known dictionary valid
for that year, and to find words that are outside the
dictionary. These words could be proofread in some tool
similar to a CAPTCHA. They might be uncommon place names
that are correctly OCRed but not in the dictionary, or
they could be OCR errors, or both.
Has anybody tried this?
In a way: 
https://www.mediawiki.org/wiki/CAPTCHA#A_homegrown_reCAPTCHA_clone
...
Such finds are not necessarily the only OCR errors.
Some OCR errors result in correctly spelled words, that
are found in the dictionary, e.g. burn -> bum.
So full manual proofreading and validation will still be
needed. But a statistics based approach could fill gaps
and quickly improve full text searchability.
True. Listing tasks to direct people to is also always a good thing on 
wikis, better than leaving people spend time on finding what to do.
Nemo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Proofreading based on statistics