IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more skill?
Alex brollo
Alex Brollo, 14/06/2013 08:45:
IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more skill?
Someone produced something here: https://groups.google.com/forum/?fromgroups#!topic/abbyy-ocr-for-linux/Ih7no7KwslA Also, from 2012: a planned "lura2hocr -- convert Luratech Abbyy XML to hOCR" https://code.google.com/p/hocr-tools/wiki/PageName
Nemo
I got it. o_O
No need of regex, lxml, pyquery nor XLST.... most simple python parsing routines can understand abbyy xml and extract both text and informations about text.
The goal was, to get by python both plain text (the same produced by wikisource server when creating a new page from a djvu text layer) and some html formatting, into a format usable by VisualEditor; and if you take a look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll see in red only owrds, where parameter wordPenalty is more than 0 into the source file abbyy xml.
Alex brollo (from it.wikisource)
2013/6/14 Alex Brollo alex.brollo@gmail.com
IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more skill?
Alex brollo
This is a link to drag into abbyy xml: http://www.abbyy-developers.com/en:tech:features:xml
It' very exciting, and far from so exoteric as it seems at a first look. Perhaps abbyy xml could be used as the main source of usable OCR data in proofread procedure (abbyy.gz file is listed into any OCR-ed Internet Archive book, and it is possible to get OCR with python routines: take a look to http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a test book where pages 17-30 come just from abbyy.xml file).
Alex
2013/6/15 Alex Brollo alex.brollo@gmail.com
I got it. o_O
No need of regex, lxml, pyquery nor XLST.... most simple python parsing routines can understand abbyy xml and extract both text and informations about text.
The goal was, to get by python both plain text (the same produced by wikisource server when creating a new page from a djvu text layer) and some html formatting, into a format usable by VisualEditor; and if you take a look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll see in red only owrds, where parameter wordPenalty is more than 0 into the source file abbyy xml.
Alex brollo (from it.wikisource)
2013/6/14 Alex Brollo alex.brollo@gmail.com
IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more skill?
Alex brollo
Just to fix our present thoughts/"discoveries".
1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail about multi-level text structure and detailed information, character by character, about formatting and recognition quality; _abbyy.xml file is published by IA as _abbyy.gz file; 2. some of _abbyy.xml data are wrapped into IA djvu text layer; multi-layer structure is saved, but details about characters are discarded; 3. MediaWiki gets the "pure text" from djvu text layer, and discards all other data multi-layer data of djvu layer, and loads the text into new nsPage pages; 4. finally & painfully wikisource users then add formatting again into raw text; in a large extent, they re-build by scratch some of data that was present into original, source abbyy.xml file and - in part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it?
Alex
2013/6/17 Alex Brollo alex.brollo@gmail.com
This is a link to drag into abbyy xml: http://www.abbyy-developers.com/en:tech:features:xml
It' very exciting, and far from so exoteric as it seems at a first look. Perhaps abbyy xml could be used as the main source of usable OCR data in proofread procedure (abbyy.gz file is listed into any OCR-ed Internet Archive book, and it is possible to get OCR with python routines: take a look to http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a test book where pages 17-30 come just from abbyy.xml file).
Alex
2013/6/15 Alex Brollo alex.brollo@gmail.com
I got it. o_O
No need of regex, lxml, pyquery nor XLST.... most simple python parsing routines can understand abbyy xml and extract both text and informations about text.
The goal was, to get by python both plain text (the same produced by wikisource server when creating a new page from a djvu text layer) and some html formatting, into a format usable by VisualEditor; and if you take a look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll see in red only owrds, where parameter wordPenalty is more than 0 into the source file abbyy xml.
Alex brollo (from it.wikisource)
2013/6/14 Alex Brollo alex.brollo@gmail.com
IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more skill?
Alex brollo
Are you following this thread? Is it something we can share with one of the GSoCers?
Aubrey
On Mon, Jun 17, 2013 at 8:32 AM, Alex Brollo alex.brollo@gmail.com wrote:
Just to fix our present thoughts/"discoveries".
- ABBYY OCR procedure outputs _abbyy.xml file, containing any detail
about multi-level text structure and detailed information, character by character, about formatting and recognition quality; _abbyy.xml file is published by IA as _abbyy.gz file; 2. some of _abbyy.xml data are wrapped into IA djvu text layer; multi-layer structure is saved, but details about characters are discarded; 3. MediaWiki gets the "pure text" from djvu text layer, and discards all other data multi-layer data of djvu layer, and loads the text into new nsPage pages; 4. finally & painfully wikisource users then add formatting again into raw text; in a large extent, they re-build by scratch some of data that was present into original, source abbyy.xml file and - in part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it?
Alex
2013/6/17 Alex Brollo alex.brollo@gmail.com
This is a link to drag into abbyy xml: http://www.abbyy-developers.com/en:tech:features:xml
It' very exciting, and far from so exoteric as it seems at a first look. Perhaps abbyy xml could be used as the main source of usable OCR data in proofread procedure (abbyy.gz file is listed into any OCR-ed Internet Archive book, and it is possible to get OCR with python routines: take a look to http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a test book where pages 17-30 come just from abbyy.xml file).
Alex
2013/6/15 Alex Brollo alex.brollo@gmail.com
I got it. o_O
No need of regex, lxml, pyquery nor XLST.... most simple python parsing routines can understand abbyy xml and extract both text and informations about text.
The goal was, to get by python both plain text (the same produced by wikisource server when creating a new page from a djvu text layer) and some html formatting, into a format usable by VisualEditor; and if you take a look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll see in red only owrds, where parameter wordPenalty is more than 0 into the source file abbyy xml.
Alex brollo (from it.wikisource)
2013/6/14 Alex Brollo alex.brollo@gmail.com
IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more skill?
Alex brollo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
On 06/17/2013 08:32 AM, Alex Brollo wrote:
Just to fix our present thoughts/"discoveries".
- ABBYY OCR procedure outputs _abbyy.xml file, containing any detail
about multi-level text structure and detailed information, character by character, about formatting and recognition quality; _abbyy.xml file is published by IA as _abbyy.gz file; 2. some of _abbyy.xml data are wrapped into IA djvu text layer; multi-layer structure is saved, but details about characters are discarded; 3. MediaWiki gets the "pure text" from djvu text layer, and discards all other data multi-layer data of djvu layer, and loads the text into new nsPage pages; 4. finally & painfully wikisource users then add formatting again into raw text; in a large extent, they re-build by scratch some of data that was present into original, source abbyy.xml file and - in part - into djvu text layer. :-(
This seems deeply unsound IMHO; isn't it?
Yes. But it's the best current practice. We know no better way, that we can afford. I suspect that Google develops its own OCR software and probably uses some manual proofreaders, but hopefully with much tighter feedback loop to the OCR software developers than we have. Both the Internet Archive and Wikisource volunteers use a cheap, commercial version of ABBYY Finereader and we have no dialogue with that company. And why should they listen to us? We have no more money to provide, but Google does pay its OCR software developers.
We could set up a 10 to 50 people team of OCR developers, if we had the money. It would operate on all the scanned images in the Internet Archive, and work closely with proofreaders to improve the overall text quality. Should we? It is easy to calculate the cost for salaries and equipment, but how do we calculate the benefit that this team brings to society?
If we were already paying salaries to proofreaders, then we could save a lot of money by producing better OCR text (with formatting). But we have no such existing expenditure to reduce.
On Mon, Jun 17, 2013 at 10:12 AM, Lars Aronsson lars@aronsson.se wrote:
Both the Internet Archive and Wikisource volunteers use a cheap, commercial version of ABBYY Finereader and we have no dialogue with that company. And why should they listen to us? We have no more money to provide, but Google does pay its OCR software developers.
I actually had a contact with a ABBYY Finereader sales manager, but after a short conversation in this list I didn't follow up, as the community was not enthusiastic about that, and I was worried about the amount of money they could request us.
Aubrey
Andrea Zanni, 17/06/2013 11:00:
I actually had a contact with a ABBYY Finereader sales manager, but after a short conversation in this list I didn't follow up, as the community was not enthusiastic about that, and I was worried about the amount of money they could request us.
Sure, we could get a better version of Finereader for server-side OCR use, but they'd certainly not change format for our purposes.
Nemo
Just to remarck that IA OCR is excellent - but is eavy limited by poor scan quality, since Google shares online bad scans (I presume, Google saves much better scans for internal use :-) ). This is why IMHO the most efficient procedure to have a good OCR for free is, simply to upload into IA an excellent pdf from TIFF-saved scans, then wait briefly for output.
What is to be discouraged is, to upload directly low quality pdfs from Google, to transform them into low quality djvu, and to use FineReader 10 or 11 on them: there's presently no way to get abbyy.xml file by FineReader 10 or 11. Even qurking with low quality pdf by Google, presently the best option is to upload them into IA; can be that character recognition can be obtained from FineReader 10 or 11, but the best obtained from FineReader 11 is a structured,mapped djvu text layer by djvu exportation, while all the remaining formatting (font size, bold, uncertainty of words) is lost.
Alex
2013/6/17 Andrea Zanni zanni.andrea84@gmail.com
On Mon, Jun 17, 2013 at 10:12 AM, Lars Aronsson lars@aronsson.se wrote:
Both the Internet Archive and Wikisource volunteers use a cheap, commercial version of ABBYY Finereader and we have no dialogue with that company. And why should they listen to us? We have no more money to provide, but Google does pay its OCR software developers.
I actually had a contact with a ABBYY Finereader sales manager, but after a short conversation in this list I didn't follow up, as the community was not enthusiastic about that, and I was worried about the amount of money they could request us.
Aubrey
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Alex Brollo, 17/06/2013 11:31:
Just to remarck that IA OCR is excellent - but is eavy limited by poor scan quality, since Google shares online bad scans [...]
This is not true. Finereader 8 on archive.org is wonderful, but it lacks a bunch of improvements compared to 11, most notably – of course – language support. http://finereader.abbyy.com/corporate/new_features/ http://finereader.abbyy.com/professional/tech_specs/#lang IA's OCR for some languages or typefaces will be totally useless, while with recent versions it would be ok.
Nemo
wikisource-l@lists.wikimedia.org