2014-10-01 15:15 GMT+02:00 Jane Darnell <jane023@gmail.com>:
I have seen many messy text-image mixes on Google books, especially older texts from manual typesetting days.  That's why I was wondering if it would be possible to have a tool that stores pages as you go, so you can step in and adjust it on a per page basis. I am not familiar with abbyy.xml files, but this may be the way to go

I burned out some millions of neurons while attempting to parse abbyy xml files, since I'm not a professional programmer, but what I vaguely saw and got is very, very exciting.  Unluckily my scripts are so rough that can't be shared, but I'm certain that real programmers could get unbeliavable results from such tons of data. I found too values of certainty of OCR recognition for any character and for any word, so that uncertain words could be highlighted when imported... or passed to a recaptcha tool. But abbyy xml use would be a next step; what I'll like by now is simply mapped text layer from djvu files - made simple and useful for any wikisource user. 
