Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

19 Jul 2013

      I don't see the possibility of directly editing the ABBYY xml file
happening any time soon. In theory, it should be possible, since that is
somewhat similar to what Visual Editor is doing: providing a WYSIWYG
interface to edit structured data (html+rdf in VE's case). But that's a
(very) long-term plan, and its relevance is not even clear to me. In
this regard, I agree with what David and Alex said.
Still, there are two things we could do with these xml files:
* extract information beyond the raw text to do some pre-formatting
prior to the page creation: this could include paragraphs, centered
texts etc. Some good OCR/layout detection softwares are even able to
detect font information, like bold or italic. However, and I could be
wrong here, it seems to me that the impact of such pre-formatting would
be limited: when proofreading, most of the time is spent correcting OCR
mistakes, the formatting can be made on-the-go and has an almost
negligible time cost.
* import the proofread text back into the xml file. By doing so, we
would recover the position of words across the page for the proofread
text. This would allow us to provide PDFs with a curated text layer.
Such PDFs would be truly and fully searchable, which I think would be
highly valuable for bibliophiles. This task somehow requires to align
two texts: map each word in the proofread text to one word in the
original ABBY file (this is not entirely accurate since two words are
sometimes recognized as a single word by the OCR, and vice versa). I
have a few ideas on how to properly solve this problem: it is actually
very similar (and even simpler!) to the so-called "phrase alignment"
problem found in machine translation and natural language processing and
the probabilistic models it uses could easily be adapted to our problem.
I know that some attempts have been made in the past to tackle this
problem, but I don't have a clear view of what has been tried exactly,
and how successful the attempts were. I would highly appreciate any
information you could have about this.
Thibaut
On 07/17/2013 10:13 PM, David Cuenca wrote:
...
I agree with Alex, the xml is not about getting editors to work with
it, but to improve the output of the text. If it can be combined with
the Visual Editor to add some pre-formatting and maybe signaling which
words are unclear, that would be already a big improvement.
If in addition to that, it can be used to compare proofread text with
ocr text for remapping purposes, even better.
Micru
On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo <alex.brollo@gmail.com
mailto:alex.brollo@gmail.com> wrote:
Perhaps there's a misinterpretation - I mentioned abbyy.xml but
with no project to import it as-it-is; abbyy.xml is only a
surprising data container from which extract anything useful to
speed up proofreading (and formatting) - nothing more than this. 

Just an example: vertical djvu coordinates of lines can be used to
get font-size; horizontal coordinates of lines can be used to
recognize  centered text; paragraphs splitting is valuable;
coolumns can be recognized; margin too; with some effort probably
poems can pop up.

Far from simply importing  coordinates, it's a matter of use them
at our best; no data, no data information contents.  

Alex

2013/7/17 Lars Aronsson <lars@aronsson.se <mailto:lars@aronsson.se>>

    On 07/17/2013 12:57 PM, Alex Brollo wrote:

        FineReader OCR stores an incredibly detailed information
        in [...] abbyy.xml

    In the other end, Wikisource is a wiki that edits wiki text.
    Sure, you could insert the XML there and let users
    edit the XML, but that would scare more users away
    and allow for more mistakes.

    For example, if proofreading Hamlet,

      To be or not to bc, that is the question,

    anybody can easily spot "bc" and correct that.
    In the XML version,

     <word x=1 y=1>To</word>
     <word x=5 y=1>be</word>
     <word x=8 y=1>or</word>

    someone might think that "or" should be a litte more
    to the right, so one user inserts a space between the
    tag "<word x=8 y=1>" and "or", while another user
    adjusts the tag to "<word x=9 y=1>". All the tags
    make it harder to spot the OCR error "bc".

    Even if you replace manual XML editing with some
    graphic tool, you get the same ambiguity between
    adding whitespace and adjusting coordinates.

    This is a nightmare that we avoid by throwing away
    all the coordinates and just proofreading the plain text.
    It is not the perfect system, it's a compromise, in
    order to get some useful work done.

    -- 
      Lars Aronsson (lars@aronsson.se <mailto:lars@aronsson.se>)
      Project Runeberg - free Nordic literature -
    http://runeberg.org/

    _______________________________________________
    Wikisource-l mailing list
    Wikisource-l@lists.wikimedia.org
    <mailto:Wikisource-l@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- 
Etiamsi omnes, ego non

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu