Re: [Wikisource-l] ABBYY xml files: any of you is working about?

17 Jun 2013

Are you following this thread?
Is it something we can share with one of the GSoCers?

Aubrey

On Mon, Jun 17, 2013 at 8:32 AM, Alex Brollo &lt;alex.brollo(a)gmail.com&gt; wrote:

...
  Just to fix our present
thoughts/"discoveries".

 1. ABBYY OCR procedure outputs _abbyy.xml file, containing any detail
 about multi-level text structure and detailed information, character by
 character, about formatting and recognition quality; _abbyy.xml file is
 published by IA as _abbyy.gz file;
 2. some of _abbyy.xml data are wrapped into IA djvu text layer;
 multi-layer structure is saved, but details about characters are discarded;
 3. MediaWiki gets the "pure text" from djvu text layer, and discards all
 other data multi-layer data of djvu layer, and loads the text into new
 nsPage pages;
 4. finally & painfully wikisource users then add formatting  again into
 raw text; in a large extent, they re-build by scratch some of data that was
 present into original, source abbyy.xml file and - in part - into djvu text
 layer. :-(

 This seems deeply unsound IMHO; isn't it?

 Alex

 2013/6/17 Alex Brollo &lt;alex.brollo(a)gmail.com&gt;

  This is a link to drag into abbyy xml:
 http://www.abbyy-developers.com/en:tech:features:xml

 It' very exciting, and far from so exoteric as it seems at a first look.
 Perhaps abbyy xml could be used as the main source of usable OCR data in
 proofread procedure (abbyy.gz file is listed into any OCR-ed Internet
 Archive book, and it is possible to get OCR with python routines: take a
 look to
 http://it.wikisource.org/wiki/Indice:Fisiologia_del_matrimonio.djvu, a
 test book where pages 17-30 come just from abbyy.xml file).

 Alex

 2013/6/15 Alex Brollo &lt;alex.brollo(a)gmail.com&gt;

  I got it. o_O

 No need of regex, lxml, pyquery nor XLST.... most simple python parsing
 routines can understand abbyy xml and extract both text and informations
 about text.

 The goal was, to get by python both plain text (the same produced by
 wikisource server when creating a new page from a djvu text layer) and some
 html formatting, into a format usable by VisualEditor; and if you take a
 look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll
 see in red only owrds, where parameter wordPenalty is more than 0 into the
 source file abbyy xml.

 Alex brollo (from it.wikisource)

 2013/6/14 Alex Brollo &lt;alex.brollo(a)gmail.com&gt;

  IA gives abbyy xml files too (as .gz files); I
opened one of them after
 a suggestion of Phe, and I'm dreaming about extracting anything useful to
 help proofreading. The only "small" problem is that I barely know what a
 xml is and that is similat to html in its (well-formed) structure, and that
 something called XLST exists. :-(

 Is any of you working about abbyy xml files with a "little bit" of more
 skill?

 Alex brollo

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] ABBYY xml files: any of you is working about?