I don't see the possibility of directly
editing the ABBYY xml file happening any time soon. In theory, it
should be possible, since that is somewhat similar to what Visual
Editor is doing: providing a WYSIWYG interface to edit structured
data (html+rdf in VE's case). But that's a (very) long-term plan,
and its relevance is not even clear to me. In this regard, I agree
with what David and Alex said.
Still, there are two things we could do with these xml files:
* extract information beyond the raw text to do some
pre-formatting prior to the page creation: this could include
paragraphs, centered texts etc. Some good OCR/layout detection
softwares are even able to detect font information, like bold or
italic. However, and I could be wrong here, it seems to me that
the impact of such pre-formatting would be limited: when
proofreading, most of the time is spent correcting OCR mistakes,
the formatting can be made on-the-go and has an almost negligible
time cost.
* import the proofread text back into the xml file. By doing so,
we would recover the position of words across the page for the
proofread text. This would allow us to provide PDFs with a curated
text layer. Such PDFs would be truly and fully searchable, which I
think would be highly valuable for bibliophiles. This task somehow
requires to align two texts: map each word in the proofread text
to one word in the original ABBY file (this is not entirely
accurate since two words are sometimes recognized as a single word
by the OCR, and vice versa). I have a few ideas on how to properly
solve this problem: it is actually very similar (and even
simpler!) to the so-called "phrase alignment" problem found in
machine translation and natural language processing and the
probabilistic models it uses could easily be adapted to our
problem. I know that some attempts have been made in the past to
tackle this problem, but I don't have a clear view of what has
been tried exactly, and how successful the attempts were. I would
highly appreciate any information you could have about this.
Thibaut
On 07/17/2013 10:13 PM, David Cuenca wrote:
I agree with Alex, the xml is not about getting
editors to work with it, but to improve the output of the text. If
it can be combined with the Visual Editor to add some
pre-formatting and maybe signaling which words are unclear, that
would be already a big improvement.
If in addition to that, it can be used to compare proofread text
with ocr text for remapping purposes, even better.
Micru
On Wed, Jul 17, 2013 at 3:26 PM, Alex
Brollo
<alex.brollo@gmail.com>
wrote:
Perhaps there's a misinterpretation - I
mentioned abbyy.xml but with no project to import it
as-it-is; abbyy.xml is only a surprising data container from
which extract anything useful to speed up proofreading (and
formatting) - nothing more than this.
Just an example: vertical djvu coordinates of lines can
be used to get font-size; horizontal coordinates of lines
can be used to recognize centered text; paragraphs
splitting is valuable; coolumns can be recognized; margin
too; with some effort probably poems can pop up.
Far from simply importing coordinates, it's a matter
of use them at our best; no data, no data information
contents.
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
--
Etiamsi omnes, ego non
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l