On 06/17/2013 08:32 AM, Alex Brollo wrote:
Just to fix our present thoughts/"discoveries".
- ABBYY OCR procedure outputs _abbyy.xml file, containing any detail
about multi-level text structure and detailed information, character by character, about formatting and recognition quality; _abbyy.xml file is published by IA as _abbyy.gz file; 2. some of _abbyy.xml data are wrapped into IA djvu text layer; multi-layer structure is saved, but details about characters are discarded; 3. MediaWiki gets the "pure text" from djvu text layer, and discards all other data multi-layer data of djvu layer, and loads the text into new nsPage pages; 4. finally & painfully wikisource users then add formatting again into raw text; in a large extent, they re-build by scratch some of data that was present into original, source abbyy.xml file and - in part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it?
Yes. But it's the best current practice. We know no better way, that we can afford. I suspect that Google develops its own OCR software and probably uses some manual proofreaders, but hopefully with much tighter feedback loop to the OCR software developers than we have. Both the Internet Archive and Wikisource volunteers use a cheap, commercial version of ABBYY Finereader and we have no dialogue with that company. And why should they listen to us? We have no more money to provide, but Google does pay its OCR software developers.
We could set up a 10 to 50 people team of OCR developers, if we had the money. It would operate on all the scanned images in the Internet Archive, and work closely with proofreaders to improve the overall text quality. Should we? It is easy to calculate the cost for salaries and equipment, but how do we calculate the benefit that this team brings to society?
If we were already paying salaries to proofreaders, then we could save a lot of money by producing better OCR text (with formatting). But we have no such existing expenditure to reduce.