Exploring abbyy.xml: a layman trip - Wikisource-l

1 Jul 2013


      Just to let you know what I'm doing: I'm exploring abbyy.xml (_abbyy.gz
file in Internet Archive file list).
The abbyy.xml file contains many data to go much ahead into
"self-formatting" of text - with details that can't be found into text
layer of djvu files. It contains the XCA_Extended version of xml output of
OCR: (http://www.abbyy-developers.com/en:tech:features:xml), and this is a
brief list of its useful features:
1. coordinates l,t,r,b of any element (from page to character )
2. three main "blockType": text, table, picture;
3. four level details of text areas: region, paragraph, line, character
(and a fifth one, word, can be calculated);
4. data about indenting, font size, word and character certainty of
recognition.
Using coordinates and original images, it's possible to extract images from
original page image; this could be useful both for a "wikiReCaptcha" engine
(extracting doubtful word text  and their images) and to extract (or show
without extracting) pictures (the latter can be done showing a clone of
existing thumbnail of the page as the background of a div, and setting
appropriately div and overflow coordinates, with a very low server load).
In brief: all this stuff is extremely exciting, I'm going ahead with my
bold tries, but the matter deserves IMHO the interest of best source geeks
- I'm only playing with very limited skill with a rough layman programming
style.
Alex brollo (from it.wikisource)