Re: [Wikisource-l] Proofreading based on statistics

24 May 2013


      I explored as a user the website of Distributed Proofreaders, to catch
ideas about proofreading. It has been a very productive and highlighting
experience, even if the whole philosophy of DP proofreading/formatting is
completely different - and incompatible - with wiki approach. One of tools
is an excellent customable, js-based spelling dictionary. How much I desire
something like that into wikisource! Obviuosly we need an excellent, very
simply customable tool - ideally, a "specific book spelling tool", I tried
to think about but there are lots of difficulties - the first one is, that
it's difficult to highlight words into a textarea by js. Can be, that
VisualEditor could make things easier.
Alex
2013/5/24 Andrea Zanni zanni.andrea84@gmail.com
...
I completely agree with Lars.
I remember, for example, an awesome tool from Alex Brollo, postOCR,
a js script which corrects automatically most common OCR errors and
converts apostrophes.
The tool is very useful and very used, and it would improve a lot from
a given list of common OCR errors per book.
Moreover, a set of stats per books
 (list of words used, counting those words, etc.)
could be very interesting for a tiny range of readers, but skilled ones,
as digital humanists and philologists.
As an example, we are collaborating right now with a philologist (a
digital humanist)
who put text on Wikisource, proofread them with the community,
and then works on them.
Aubrey
On Fri, May 24, 2013 at 1:54 AM, Lars Aronsson lars@aronsson.se wrote:
...
It should be possible, in any language of Wikisource, to
check all existing text against a known dictionary valid
for that year, and to find words that are outside the
dictionary. These words could be proofread in some tool
similar to a CAPTCHA. They might be uncommon place names
that are correctly OCRed but not in the dictionary, or
they could be OCR errors, or both.
Has anybody tried this?
Such finds are not necessarily the only OCR errors.
Some OCR errors result in correctly spelled words, that
are found in the dictionary, e.g. burn -> bum.
So full manual proofreading and validation will still be
needed. But a statistics based approach could fill gaps
and quickly improve full text searchability.
--
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
______________________________**_________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.**org Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/**mailman/listinfo/wikisource-lhttps://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Proofreading based on statistics