I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended
detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).
Something that wikisource best developers should explore; comparing those data with the little bit of data into mapped text layer of djvu files is impressive and should be inspiring.
But they are static data coming from a standard setting... nothing similar to a service with simple, shared, deep learning features for difficult and ancient texts. I tried "ancient italian" tesseract dictionary with very poor results.
So Asaf, I can't wait for good news from you. :-)
Alex