Just to inspire any of you who can manage comfortably xml: I'm going to try
to merge two interesting files from Internet Archive derive routine,
djvu.xml and abbyy.xml.
As many of you know, both contain mapped text from OCR. The first one
contains mapped text at a word detail, without any formatting data and/or
data about recognition probability; the second one is much more complex, it
gives a pretty complex set of data at a detail of single character; it's so
complex and detailed that it is almost unusable. Both seem to come from the
same process and seem identical in shared data.
So, it seems possible to extract some interesting, selected from abbyy.xml
file and inject them into djvu.xml file, then loading the result into djvu
text layer; the mostly interesting data being the probability of
recognition of words.
It such an idea already been explored/developed? I often "rediscover the
wheel" :-)
Alex brollo