Re: [Wikisource-l] On linking Wikisource with page images

22 Jan 2008


      On Jan 21, 2008 6:27 PM, Jesse Martin (Pathoschild)
pathoschild@gmail.com wrote:
...
Hello,
I'm just wondering, would it be feasible to convert wiki text (without
OCR markup) back into OCR markup? A script might strip or convert
markup, diff the original OCR text with the wiki text to determine
what goes where, and generate the markup from scratch.
You could thus cleanly convert from OCR markup to wiki markup and back
without unreadable OCR markup on the wiki, and this could also be used
to provide some other very useful features (I would love to accurately
diff an entire Wikisource text with OCR scans of different printed
documents, for example).
Getting the edge cases right would be hard to impossible...  For
example the OCR reads "the ball bounced and" "saw it hit the floor" ..
you fix a missing word: "the ball bounced and I saw it hit the floor"
... What line is the I on in the OCR output?
The nice things about spans are they are invisible in the output.. you
should be able to preserve them while doing all the markup you want.
The bad thing is that they are visible while editing (but could be
hidden), and are a pain to fix if the OCR was very wrong.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] On linking Wikisource with page images