AubreyGo Alex!IMO, it changes a lot if a book is formatted ~50% by a machine, we could do much more books in less time.we could actually automatize *a lot* of the formatting work now done by volunteers, and their time could be spent still formatting, proofreading and validating, but with much power than before.Thanks Alex!I really hope this is a direction where other developers will follow: being able to harness the full potential of structured data from OCR software is absolutely crucial for Wikisource:On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov <abartov@wikimedia.org> wrote:That's really promising!Thank you for sharing this.A.On Oct 17, 2017 00:11, "Alex Brollo" <alex.brollo@gmail.com> wrote:______________________________Here:and immediately previous and following pages both the text and some formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz (in previous pages only some templates have been added and a little bit of regex manipulation has be done) Internet Archive _abbyy.gz files are gzipped, enormous xml files where any detail of FineReader OCR output is exported - but, even if enormous and terribly complex, they can be parsed and any detail (a little bit painfully...) can be used; presently, only bold, italic, smallcaps and paragraphs have been explored, translated into wiki code by a prettily simple python code.Alex_________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l