Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

19 Jul 2013

      On Fri, Jul 19, 2013 at 8:13 AM, Thibaut Horel thibaut.horel@gmail.comwrote:
...
I don't see the possibility of directly editing the ABBYY xml file
happening any time soon. In theory, it should be possible, since that is
somewhat similar to what Visual Editor is doing: providing a WYSIWYG
interface to edit structured data (html+rdf in VE's case). But that's a
(very) long-term plan, and its relevance is not even clear to me. In this
regard, I agree with what David and Alex said.
Still, there are two things we could do with these xml files:

extract information beyond the raw text to do some pre-formatting prior

to the page creation: this could include paragraphs, centered texts etc.
Some good OCR/layout detection softwares are even able to detect font
information, like bold or italic. However, and I could be wrong here, it
seems to me that the impact of such pre-formatting would be limited: when
proofreading, most of the time is spent correcting OCR mistakes, the
formatting can be made on-the-go and has an almost negligible time cost.
I still think that doing most of the work automatically (if possible) would
be a good idea. I actually like formatting (eg bold, italics) much more
than I like proofreading OCR, but I also think that the less burden we give
our proofreaders the better it is.
I mean, if I'm proofreading a text, and I see the text is already well
formatted, it saves time: if it's formatted badly, I can still correct it,
right?
...

import the proofread text back into the xml file. By doing so, we would

recover the position of words across the page for the proofread text. This
would allow us to provide PDFs with a curated text layer. Such PDFs would
be truly and fully searchable, which I think would be highly valuable for
bibliophiles. This task somehow requires to align two texts: map each word
in the proofread text to one word in the original ABBY file (this is not
entirely accurate since two words are sometimes recognized as a single word
by the OCR, and vice versa). I have a few ideas on how to properly solve
this problem: it is actually very similar (and even simpler!) to the
so-called "phrase alignment" problem found in machine translation and
natural language processing and the probabilistic models it uses could
easily be adapted to our problem. I know that some attempts have been made
in the past to tackle this problem, but I don't have a clear view of what
has been tried exactly, and how successful the attempts were. I would
highly appreciate any information you could have about this.
I think Seb35 studied a bit the subject few years ago, with all the
probabilistic things and markovian chains and funny stuff you all like :-)
(I always amazes me how many mathematicians or like are involved in
Wikisource. My conclusion is that we like to put order in abstract spaces.
Aubrey
...
Thibaut
On 07/17/2013 10:13 PM, David Cuenca wrote:
I agree with Alex, the xml is not about getting editors to work with it,
but to improve the output of the text. If it can be combined with the
Visual Editor to add some pre-formatting and maybe signaling which words
are unclear, that would be already a big improvement.
If in addition to that, it can be used to compare proofread text with ocr
text for remapping purposes, even better.
Micru
On Wed, Jul 17, 2013 at 3:26 PM, Alex Brollo alex.brollo@gmail.comwrote:
...
Perhaps there's a misinterpretation - I mentioned abbyy.xml but with no
project to import it as-it-is; abbyy.xml is only a surprising data
container from which extract anything useful to speed up proofreading (and
formatting) - nothing more than this.
Just an example: vertical djvu coordinates of lines can be used to get
font-size; horizontal coordinates of lines can be used to recognize
 centered text; paragraphs splitting is valuable; coolumns can be
recognized; margin too; with some effort probably poems can pop up.
Far from simply importing  coordinates, it's a matter of use them at
our best; no data, no data information contents.
Alex
2013/7/17 Lars Aronsson lars@aronsson.se
...
On 07/17/2013 12:57 PM, Alex Brollo wrote:
...
FineReader OCR stores an incredibly detailed information in [...]
abbyy.xml
In the other end, Wikisource is a wiki that edits wiki text.
Sure, you could insert the XML there and let users
edit the XML, but that would scare more users away
and allow for more mistakes.
For example, if proofreading Hamlet,
To be or not to bc, that is the question,
anybody can easily spot "bc" and correct that.
In the XML version,
<word x=1 y=1>To</word>
 <word x=5 y=1>be</word>
 <word x=8 y=1>or</word>
someone might think that "or" should be a litte more
to the right, so one user inserts a space between the
tag "<word x=8 y=1>" and "or", while another user
adjusts the tag to "<word x=9 y=1>". All the tags
make it harder to spot the OCR error "bc".
Even if you replace manual XML editing with some
graphic tool, you get the same ambiguity between
adding whitespace and adjusting coordinates.
This is a nightmare that we avoid by throwing away
all the coordinates and just proofreading the plain text.
It is not the perfect system, it's a compromise, in
order to get some useful work done.
--
  Lars Aronsson (lars@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
--
Etiamsi omnes, ego non

Wikisource-l mailing listWikisource-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu