Re: [Wikisource-l] ABBYY xml files: any of you is working about?

17 Jun 2013


      On 06/17/2013 08:32 AM, Alex Brollo wrote:
...
Just to fix our present thoughts/"discoveries".

ABBYY OCR procedure outputs _abbyy.xml file, containing any detail

about multi-level text structure and detailed information, character 
by character, about formatting and recognition quality; _abbyy.xml 
file is published by IA as _abbyy.gz file;
2. some of _abbyy.xml data are wrapped into IA djvu text layer; 
multi-layer structure is saved, but details about characters are 
discarded;
3. MediaWiki gets the "pure text" from djvu text layer, and discards 
all other data multi-layer data of djvu layer, and loads the text into 
new nsPage pages;
4. finally & painfully wikisource users then add formatting  again 
into raw text; in a large extent, they re-build by scratch some of 
data that was present into original, source abbyy.xml file and - in 
part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it?
Yes. But it's the best current practice. We know no
better way, that we can afford. I suspect that Google
develops its own OCR software and probably uses
some manual proofreaders, but hopefully with much
tighter feedback loop to the OCR software
developers than we have. Both the Internet Archive
and Wikisource volunteers use a cheap, commercial
version of ABBYY Finereader and we have no
dialogue with that company. And why should they
listen to us? We have no more money to provide,
but Google does pay its OCR software developers.
We could set up a 10 to 50 people team of OCR
developers, if we had the money. It would operate
on all the scanned images in the Internet Archive,
and work closely with proofreaders to improve
the overall text quality. Should we? It is easy to
calculate the cost for salaries and equipment, but
how do we calculate the benefit that this team
brings to society?
If we were already paying salaries to proofreaders,
then we could save a lot of money by producing
better OCR text (with formatting). But we have no
such existing expenditure to reduce.
-- 
   Lars Aronsson (lars@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] ABBYY xml files: any of you is working about?