Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

17 Jul 2013


      I'm forwarding this message by George Orwell III on en-ws [1]. I think it
is extremely important as it offers an insight about what is wrong with
Djvu handling on Wikisource.
"We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
because the original PHP contributing a-hole for the DjVu routine on our
servers never bothered to finish the part where the internal DjVu text
layer is converted to a (coordinate rich) XML file using the existing
DjVuLibre software package because, at the time, the software had issues.
"That faulty DjVuLibre version was the equivalent of 4,317 versions ago and
the issue has been long fixed now EXCEPT that the .DTD file needed to base
the plain-text to XML conversion on still has the wrong 'folder path' on
local DjVuLibre installs (if this is true on server installs as well, I
cannot say for sure). Once I copied the folder to the [wrong] folder path,
I was able to generate the XMLs all day long. These XMLs are just like the
ones IA generates during their process (in addition to the XML that AABBY
generates for them).
"So its not that we as a community decided not to follow through with
(coordinate rich) XML generation but got stuck with the plain-text dump
workaround due to a DjVuLibre problem that no longer exists. Plus, the guy
who created the beginnings of this fabulous disaster was like tick with an
attention span deficit and moved on to conjuring up some other blasted
thing or another instead of following up on his own workaround & finish the
XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July 2013
(UTC)
[1]
http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo alex.brollo@gmail.com wrote:
...
Just a brief comment about djvu text layer, using IA files to digging
deeper the topic.
FineReader OCR stores an incredibly detailed information in a proprietary
format; then, various FineReader versions export something of this
extremely rich set of information into different outputs - one of them
being djvu text layer. It's worth to note that even if any information
stored into djvu text layer can be extracted and used, the set of
information wrapped into djvu text layer (both in lisp-like format or in
xml format) is only a minor subset of original OCR information.
If someone is interested to get much more information, it can find it into
abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list
of exportable files. It's a very heavy and complex xml structure but it is
possible to parse it, end to extract from it any information wrapped into
djvu text layer and much more - most interestingly, wortPenalty, that is,
word by word, the resume of degree of incertainty of OCR recognition of the
whole word.
We (I and Aarti) are digging into this mess, with fast preliminary
results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief
pieces of text extracted from abbyy.gx, where doubtful  words (in the
opinion of OCR software) are red. They can be easily managed by
VisualEditor - caming simply from a simple span tag.
Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
run, it would be possible to extract text by bot from abbyy.gz (if the work
comes from IA) and to upload such text as OCR.
Alex
2013/7/16 David Cuenca dacuetu@gmail.com
...
Hi Aubrey,
Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on
the djvu text extraction/merging and he was interested in following-up on
that. Maybe he has some fresh ideas about it.
Micru
On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni zanni.andrea84@gmail.comwrote:
...
Hi David, Aarti, thibaud and Tpt,
please look at this thread:
http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
especially the last message.
It seems George Orwell III knows his stuff about Djvu and Proofread
extension,
and it's probably worth digging into this "layer text" djvu thing.
Even if I might dream of an ideal solution (a "layered structure" for
wikisource, in which text can marked up several times in different layers)
that is probably very far away.
But it's still important to pave the way for further improvements, I
guess:
losing all the information from a formatted, mapped IA djvu it's not a
good thing to do, IMHO.
And the Visual Editor could help us, in the future, to keep some of that
information (italics, bold, etc.)
I know Aarti spoke with Alex about abbyy.xml: is it possible to do
something with it?
Aubrey
--
Etiamsi omnes, ego non
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- 
Etiamsi omnes, ego non

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu