Re: [Wikisource-l] The very first result of IA _abbyy.gz parsing & bot uploading into nsPage

16 Oct 2017

Thanks Alex!
I really hope this is a direction where other developers will follow: being
able to harness the full potential of structured data from OCR software is
absolutely crucial for Wikisource:
we could actually automatize *a lot* of the formatting work now done by
volunteers, and their time could be spent still formatting, proofreading
and validating, but with much power than before.
IMO, it changes a lot if a book is formatted ~50% by a machine, we could do
much more books in less time.
Go Alex!

Aubrey

On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov &lt;abartov(a)wikimedia.org&gt; wrote:

...
  That's really promising!

 Thank you for sharing this.

    A.

 On Oct 17, 2017 00:11, "Alex Brollo" &lt;alex.brollo(a)gmail.com&gt; wrote:

  Here:
 Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46

<https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
 and immediately previous and following pages both the text and some
 formatting  from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
 <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
  (in previous pages only some templates have been added and a little bit
 of regex manipulation has be done)

 Internet Archive _abbyy.gz files are gzipped, enormous xml files where
 any detail of FineReader OCR output is exported - but, even if enormous and
 terribly complex, they can be parsed and any detail (a little bit
 painfully...)  can be used; presently, only bold, italic,  smallcaps and
 paragraphs have been explored,  translated into wiki code by a prettily
 simple python code.

 Alex

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

  _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] The very first result of IA _abbyy.gz parsing & bot uploading into nsPage