Just to let you know what I'm doing: I'm exploring abbyy.xml (_abbyy.gz file in Internet Archive file list).

The abbyy.xml file contains many data to go much ahead into "self-formatting" of text - with details that can't be found into text layer of djvu files. It contains the XCA_Extended version of xml output of OCR: (http://www.abbyy-developers.com/en:tech:features:xml), and this is a brief list of its useful features:

1. coordinates l,t,r,b of any element (from page to character )

2. three main "blockType": text, table, picture;

3. four level details of text areas: region, paragraph, line, character (and a fifth one, word, can be calculated);

4. data about indenting, font size, word and character certainty of recognition.

Using coordinates and original images, it's possible to extract images from original page image; this could be useful both for a "wikiReCaptcha" engine (extracting doubtful word text and their images) and to extract (or show without extracting) pictures (the latter can be done showing a clone of existing thumbnail of the page as the background of a div, and setting appropriately div and overflow coordinates, with a very low server load).

In brief: all this stuff is extremely exciting, I'm going ahead with my bold tries, but the matter deserves IMHO the interest of best source geeks - I'm only playing with very limited skill with a rough layman programming style.

Alex brollo (from it.wikisource)