On 06/17/2013 08:32 AM, Alex Brollo wrote:
Just to fix our present
thoughts/"discoveries".
1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail
about multi-level text structure and detailed information, character
by character, about formatting and recognition quality; _abbyy.xml
file is published by IA as _abbyy.gz file;
2. some of _abbyy.xml data are wrapped into IA djvu text layer;
multi-layer structure is saved, but details about characters are
discarded;
3. MediaWiki gets the "pure text" from djvu text layer, and discards
all other data multi-layer data of djvu layer, and loads the text into
new nsPage pages;
4. finally & painfully wikisource users then add formatting again
into raw text; in a large extent, they re-build by scratch some of
data that was present into original, source abbyy.xml file and - in
part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it?
Yes. But it's the best current practice. We know no
better way, that we can afford. I suspect that Google
develops its own OCR software and probably uses
some manual proofreaders, but hopefully with much
tighter feedback loop to the OCR software
developers than we have. Both the Internet Archive
and Wikisource volunteers use a cheap, commercial
version of ABBYY Finereader and we have no
dialogue with that company. And why should they
listen to us? We have no more money to provide,
but Google does pay its OCR software developers.
We could set up a 10 to 50 people team of OCR
developers, if we had the money. It would operate
on all the scanned images in the Internet Archive,
and work closely with proofreaders to improve
the overall text quality. Should we? It is easy to
calculate the cost for salaries and equipment, but
how do we calculate the benefit that this team
brings to society?
If we were already paying salaries to proofreaders,
then we could save a lot of money by producing
better OCR text (with formatting). But we have no
such existing expenditure to reduce.
--
Lars Aronsson (lars(a)aronsson.se)
Project Runeberg - free Nordic literature -
http://runeberg.org/