Re: [Wikitech-l] Book scans from Tuebingen Digital Library to Wikimedia Commons

2 Dec 2018

Hi,

Great! Some questions below for better understanding what's wanted:

On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
...
  Recently Tuebingen University
 <https://uni-tuebingen.de/en/university.html> (with
 the support from German Research Foundation) ran a project titled *Gundert
 Legacy project* to digitize close to 137,000 pages from *850 public domain
 books*.

 All these public domain books are in the South Indian languages *Malayalam,
 Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187
 in Kannada, 25 in Tamil, 4 in Telugu and Tulu.

 Also there was  a separate sub-project which was run as part of this
 project to convert 136 titles in Malayalam to Malayalam Unicode. The number
 of pages that were converted to Unicode is close to *25,700* pages .The
 Unicode conversion project was ran only for Malayalam. For the other
 languages it is just the scanning of books 
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?

...
  The project is complete now and the results of the
project is available in
 the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which
 was released on Nov 20. A news report is available here.

<https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms>

 To view the books in each language you can navigate through the various
 links in the portal. For example, malayalam books are available here:
 https://www.gundert-portal.de/?page=malayalam

 Now we need to upload these scans to Wikimedia Commons and Unicode text to
 Malayalam Wikisource (25,700 Unicode converted pages)

 The first priority is for the scans that are converted to Unicode. Is it
 possible to write a script to migrate the scans from Tuebingen Digital
 library to Wikimedia Commons? (I can share the exact details of books
 converted to Unicode if needed) 
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?

To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.

...
  All the digitized files are heavy and the size ranges
from 100 MB to 1.5 GB
 depending on the number of pages in the books. So manually managing this is
 going to be a big challenge.

 Can some one help with this? 
Cheers,
andre
-- 
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Book scans from Tuebingen Digital Library to Wikimedia Commons