It' very exciting, and far from so exoteric as it seems at a first look.
Perhaps abbyy xml could be used as the main source of usable OCR data in
proofread procedure (abbyy.gz file is listed into any OCR-ed Internet
Archive book, and it is possible to get OCR with python routines: take a
look to
, a
test book where pages 17-30 come just from abbyy.xml file).
Alex
2013/6/15 Alex Brollo <alex.brollo(a)gmail.com>
I got it. o_O
No need of regex, lxml, pyquery nor XLST.... most simple python parsing
routines can understand abbyy xml and extract both text and informations
about text.
The goal was, to get by python both plain text (the same produced by
wikisource server when creating a new page from a djvu text layer) and some
html formatting, into a format usable by VisualEditor; and if you take a
look to
http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll
see in red only owrds, where parameter wordPenalty is more than 0 into the
source file abbyy xml.
Alex brollo (from it.wikisource)
2013/6/14 Alex Brollo <alex.brollo(a)gmail.com>
IA gives abbyy xml files too (as .gz files); I
opened one of them after
a suggestion of Phe, and I'm dreaming about extracting anything useful to
help proofreading. The only "small" problem is that I barely know what a
xml is and that is similat to html in its (well-formed) structure, and that
something called XLST exists. :-(
Is any of you working about abbyy xml files with a "little bit" of more
skill?
Alex brollo