I got it. o_O

No need of regex, lxml, pyquery nor XLST.... most simple python parsing routines can understand abbyy xml and extract both text and informations about text.

The goal was, to get by python both plain text (the same produced by wikisource server when creating a new page from a djvu text layer) and some html formatting, into a format usable by VisualEditor; and if you take a look to http://it.wikipedia.org/wiki/Utente:Alex_brollo/Sandbox, you'll see in red only owrds, where parameter wordPenalty is more than 0 into the source file abbyy xml.

Alex brollo (from it.wikisource)

2013/6/14 Alex Brollo <alex.brollo@gmail.com>

IA gives abbyy xml files too (as .gz files); I opened one of them after a suggestion of Phe, and I'm dreaming about extracting anything useful to help proofreading. The only "small" problem is that I barely know what a xml is and that is similat to html in its (well-formed) structure, and that something called XLST exists. :-(

Is any of you working about abbyy xml files with a "little bit" of more skill?

Alex brollo