Re: [Wikisource-l] ABBYY xml files: any of you is working about?

15 Jun 2013

I got it. o_O

No need of regex, lxml, pyquery nor XLST.... most simple python parsing
routines can understand abbyy xml and extract both text and informations
about text.

The goal was, to get by python both plain text (the same produced by
wikisource server when creating a new page from a djvu text layer) and some
html formatting, into a format usable by VisualEditor; and if you take a
look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll see
in red only owrds, where parameter wordPenalty is more than 0 into the
source file abbyy xml.

Alex brollo (from it.wikisource)

2013/6/14 Alex Brollo &lt;alex.brollo(a)gmail.com&gt;

...
  IA gives abbyy xml files too (as .gz files); I opened
one of them after a
 suggestion of Phe, and I'm dreaming about extracting anything useful to
 help proofreading. The only "small" problem is that I barely know what a
 xml is and that is similat to html in its (well-formed) structure, and that
 something called XLST exists. :-(

 Is any of you working about abbyy xml files with a "little bit" of more
 skill?

 Alex brollo

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] ABBYY xml files: any of you is working about?