Just to inspire any of you who can manage comfortably xml: I'm going to try to merge two interesting files from Internet Archive derive routine, djvu.xml and abbyy.xml.
As many of you know, both contain mapped text from OCR. The first one contains mapped text at a word detail, without any formatting data and/or data about recognition probability; the second one is much more complex, it gives a pretty complex set of data at a detail of single character; it's so complex and detailed that it is almost unusable. Both seem to come from the same process and seem identical in shared data.
So, it seems possible to extract some interesting, selected from abbyy.xml file and inject them into djvu.xml file, then loading the result into djvu text layer; the mostly interesting data being the probability of recognition of words.
It such an idea already been explored/developed? I often "rediscover the wheel" :-)
Alex brollo
wikisource-l@lists.wikimedia.org