Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

19 Jul 2013

I don't see the possibility of directly editing the ABBYY xml file
happening any time soon. In theory, it should be possible, since that is
somewhat similar to what Visual Editor is doing: providing a WYSIWYG
interface to edit structured data (html+rdf in VE's case). But that's a
(very) long-term plan, and its relevance is not even clear to me. In
this regard, I agree with what David and Alex said.

Still, there are two things we could do with these xml files:

* extract information beyond the raw text to do some pre-formatting
prior to the page creation: this could include paragraphs, centered
texts etc. Some good OCR/layout detection softwares are even able to
detect font information, like bold or italic. However, and I could be
wrong here, it seems to me that the impact of such pre-formatting would
be limited: when proofreading, most of the time is spent correcting OCR
mistakes, the formatting can be made on-the-go and has an almost
negligible time cost.

* import the proofread text back into the xml file. By doing so, we
would recover the position of words across the page for the proofread
text. This would allow us to provide PDFs with a curated text layer.
Such PDFs would be truly and fully searchable, which I think would be
highly valuable for bibliophiles. This task somehow requires to align
two texts: map each word in the proofread text to one word in the
original ABBY file (this is not entirely accurate since two words are
sometimes recognized as a single word by the OCR, and vice versa). I
have a few ideas on how to properly solve this problem: it is actually
very similar (and even simpler!) to the so-called "phrase alignment"
problem found in machine translation and natural language processing and
the probabilistic models it uses could easily be adapted to our problem.
I know that some attempts have been made in the past to tackle this
problem, but I don't have a clear view of what has been tried exactly,
and how successful the attempts were. I would highly appreciate any
information you could have about this.

Thibaut

On 07/17/2013 10:13 PM, David Cuenca wrote:
...
  I agree with Alex, the xml is not about getting
editors to work with
 it, but to improve the output of the text. If it can be combined with
 the Visual Editor to add some pre-formatting and maybe signaling which
 words are unclear, that would be already a big improvement.

 If in addition to that, it can be used to compare proofread text with
 ocr text for remapping purposes, even better.

 Micru

 On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo &lt;alex.brollo(a)gmail.com
 <mailto:alex.brollo@gmail.com>> wrote:

     Perhaps there's a misinterpretation - I mentioned abbyy.xml but
     with no project to import it as-it-is; abbyy.xml is only a
     surprising data container from which extract anything useful to
     speed up proofreading (and formatting) - nothing more than this. 

     Just an example: vertical djvu coordinates of lines can be used to
     get font-size; horizontal coordinates of lines can be used to
     recognize  centered text; paragraphs splitting is valuable;
     coolumns can be recognized; margin too; with some effort probably
     poems can pop up.

     Far from simply importing  coordinates, it's a matter of use them
     at our best; no data, no data information contents.  

     Alex

     2013/7/17 Lars Aronsson &lt;lars(a)aronsson.se <mailto:lars@aronsson.se>>

         On 07/17/2013 12:57 PM, Alex Brollo wrote:

             FineReader OCR stores an incredibly detailed information
             in [...] abbyy.xml

         In the other end, Wikisource is a wiki that edits wiki text.
         Sure, you could insert the XML there and let users
         edit the XML, but that would scare more users away
         and allow for more mistakes.

         For example, if proofreading Hamlet,

           To be or not to bc, that is the question,

         anybody can easily spot "bc" and correct that.
         In the XML version,

          <word x=1 y=1>To</word>
          <word x=5 y=1>be</word>
          <word x=8 y=1>or</word>

         someone might think that "or" should be a litte more
         to the right, so one user inserts a space between the
         tag "<word x=8 y=1>" and "or", while another user
         adjusts the tag to "<word x=9 y=1>". All the tags
         make it harder to spot the OCR error "bc".

         Even if you replace manual XML editing with some
         graphic tool, you get the same ambiguity between
         adding whitespace and adjusting coordinates.

         This is a nightmare that we avoid by throwing away
         all the coordinates and just proofreading the plain text.
         It is not the perfect system, it's a compromise, in
         order to get some useful work done.

         -- 
           Lars Aronsson (lars(a)aronsson.se <mailto:lars@aronsson.se>)
           Project Runeberg - free Nordic literature -
         http://runeberg.org/

         _______________________________________________
         Wikisource-l mailing list
         Wikisource-l(a)lists.wikimedia.org
         <mailto:Wikisource-l@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikisource-l

     _______________________________________________
     Wikisource-l mailing list
     Wikisource-l(a)lists.wikimedia.org
     <mailto:Wikisource-l@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 -- 
 Etiamsi omnes, ego non

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu