Book scans from Tuebingen Digital Library to Wikimedia Commons

List overview All Threads
Download

newer

older

Help: setting a property on page...

wmgMonologChannels

Shiju Alex

2 Dec 2018 2 Dec '18

9:52 a.m.

Hello, Recently Tuebingen University <https://uni-tuebingen.de/en/university.html> (with the support from German Research Foundation) ran a project titled *Gundert Legacy project* to digitize close to 137,000 pages from *850 public domain books*. All these public domain books are in the South Indian languages *Malayalam, Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187 in Kannada, 25 in Tamil, 4 in Telugu and Tulu. Also there was a separate sub-project which was run as part of this project to convert 136 titles in Malayalam to Malayalam Unicode. The number of pages that were converted to Unicode is close to *25,700* pages.The Unicode conversion project was ran only for Malayalam. For the other languages it is just the scanning of books The project is complete now and the results of the project is available in the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which was released on Nov 20. A news report is available here. <https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms> To view the books in each language you can navigate through the various links in the portal. For example, malayalam books are available here: https://www.gundert-portal.de/?page=malayalam Now we need to upload these scans to Wikimedia Commons and Unicode text to Malayalam Wikisource (25,700 Unicode converted pages) The first priority is for the scans that are converted to Unicode. Is it possible to write a script to migrate the scans from Tuebingen Digital library to Wikimedia Commons? (I can share the exact details of books converted to Unicode if needed) All the digitized files are heavy and the size ranges from 100 MB to 1.5 GB depending on the number of pages in the books. So manually managing this is going to be a big challenge. Can some one help with this? Shiju Alex

Show replies by date

Andre Klapper

2 Dec 2 Dec

10:37 a.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

Hi, Great! Some questions below for better understanding what's wanted: On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:

...

Recently Tuebingen University <https://uni-tuebingen.de/en/university.html> (with the support from German Research Foundation) ran a project titled *Gundert Legacy project* to digitize close to 137,000 pages from *850 public domain books*. All these public domain books are in the South Indian languages *Malayalam, Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187 in Kannada, 25 in Tamil, 4 in Telugu and Tulu. Also there was a separate sub-project which was run as part of this project to convert 136 titles in Malayalam to Malayalam Unicode. The number of pages that were converted to Unicode is close to *25,700* pages .The Unicode conversion project was ran only for Malayalam. For the other languages it is just the scanning of books

What does "converted to Unicode" mean? Converted from what exactly? Do you maybe mean "converted via OCR (Optical character recognition) from images in file formats (JPG, PNG, images in a PDF) which don't allow marking text to a file format which allows marking text in those files?

...

The project is complete now and the results of the project is available in the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which was released on Nov 20. A news report is available here. <https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms> To view the books in each language you can navigate through the various links in the portal. For example, malayalam books are available here: https://www.gundert-portal.de/?page=malayalam Now we need to upload these scans to Wikimedia Commons and Unicode text to Malayalam Wikisource (25,700 Unicode converted pages) The first priority is for the scans that are converted to Unicode. Is it possible to write a script to migrate the scans from Tuebingen Digital library to Wikimedia Commons? (I can share the exact details of books converted to Unicode if needed)

What would you want the script to do exactly? Pull the files from the Tuebingen Digital Library and then mass-upload these files to Commons? OCR (identify letters in pure images and converting those letters to text which could be marked and copied)? Something else? To convert image files available on Wikimedia Commons to recognized text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There is also https://phabricator.wikimedia.org/T120788 for more info/tools.

...

All the digitized files are heavy and the size ranges from 100 MB to 1.5 GB depending on the number of pages in the books. So manually managing this is going to be a big challenge. Can some one help with this?

Cheers, andre -- Andre Klapper | Bugwrangler / Developer Advocate https://blogs.gnome.org/aklapper/

Shiju Alex

10:54 a.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

Hi Here are the answers What does "converted to Unicode" mean? Converted from what exactly? Do

...

you maybe mean "converted via OCR (Optical character recognition) from images in file formats (JPG, PNG, images in a PDF) which don't allow marking text to a file format which allows marking text in those files?

There is no good OCR for languages like Malayalam. So each scanned image is manually typed and proofread For example, See the 7th page of this book <http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You can see the scan image on the right and the transcribed text for that page on the left in the *Transcript *tab. This is done for 136 books, and total pages on these books are close to 25,700 pages. What would you want the script to do exactly? Pull the files from the

...

Tuebingen Digital Library and then mass-upload these files to Commons?

Yes, this is what is required. Unicode migration we will handle separately. Shiju Alex

...

On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <aklapper(a)wikimedia.org> wrote: > Hi,

...

> Great! Some questions below for better understanding what's wanted:

...

> On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote: > > Recently Tuebingen University > > <https://uni-tuebingen.de/en/university.html> (with > > the support from German Research Foundation) ran a project titled > *Gundert > > Legacy project* to digitize close to 137,000 pages from *850 public > domain > > books*. >

...

> > All these public domain books are in the South Indian languages > *Malayalam, > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, > 187 > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu. >

...

> > Also there was a separate sub-project which was run as part of this > > project to convert 136 titles in Malayalam to Malayalam Unicode. The > number > > of pages that were converted to Unicode is close to *25,700* pages .The > > Unicode conversion project was ran only for Malayalam. For the other > > languages it is just the scanning of books

...

> What does "converted to Unicode" mean? Converted from what exactly? Do

...

> > The project is complete now and the results of the project is available > in > > the Hermman Gundert Portal https://www.gundert-portal.de/?language=en > which > > was released on Nov 20. A news report is available here. > > < > https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mi… >

...

> > To view the books in each language you can navigate through the various > > links in the portal. For example, malayalam books are available here: > > https://www.gundert-portal.de/?page=malayalam >

...

> > Now we need to upload these scans to Wikimedia Commons and Unicode text > to > > Malayalam Wikisource (25,700 Unicode converted pages) >

...

> > The first priority is for the scans that are converted to Unicode. Is it > > possible to write a script to migrate the scans from Tuebingen Digital > > library to Wikimedia Commons? (I can share the exact details of books > > converted to Unicode if needed)

...

> What would you want the script to do exactly? Pull the files from the

...

Tuebingen Digital Library and then mass-upload these files to Commons?

> OCR (identify letters in pure images and converting those letters to > text which could be marked and copied)? Something else?

...

> To convert image files available on Wikimedia Commons to recognized > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There > is also https://phabricator.wikimedia.org/T120788 for more info/tools.

...

> > All the digitized files are heavy and the size ranges from 100 MB to 1.5 > GB > > depending on the number of pages in the books. So manually managing this > is > > going to be a big challenge. >

...

> > Can some one help with this?

...

> Cheers, > andre > -- > Andre Klapper | Bugwrangler / Developer Advocate > https://blogs.gnome.org/aklapper/

...

> _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ryan Kaldari

3 Dec 3 Dec

4:31 a.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

...

There is no good OCR for languages like Malayalam.

Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision API (which is usable from a Wikisource gadget <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson) will do OCR on Tamil. I can't vouch for these being "good", but they do exist. On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <shijualexonline(a)gmail.com> wrote:

...

Hi Here are the answers What does "converted to Unicode" mean? Converted from what exactly? Do

Tuebingen Digital Library and then mass-upload these files to Commons?

Yes, this is what is required. Unicode migration we will handle separately. Shiju Alex

On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <aklapper(a)wikimedia.org> wrote: > Hi,

> Great! Some questions below for better understanding what's wanted:

> What does "converted to Unicode" mean? Converted from what exactly? Do

https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mi… >

> > Now we need to upload these scans to Wikimedia Commons and Unicode text > to > > Malayalam Wikisource (25,700 Unicode converted pages) >

> What would you want the script to do exactly? Pull the files from the

Tuebingen Digital Library and then mass-upload these files to Commons?

> OCR (identify letters in pure images and converting those letters to > text which could be marked and copied)? Something else?

> > All the digitized files are heavy and the size ranges from 100 MB to 1.5

GB > depending on the number of pages in the books. So manually managing

this > is > > going to be a big challenge. >

> > Can some one help with this?

> Cheers, > andre > -- > Andre Klapper | Bugwrangler / Developer Advocate > https://blogs.gnome.org/aklapper/

> _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Shiju Alex

5:21 a.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

...

The request in this post is not for creating an OCR for any language script; but to migrate certain Public Domain book scans from Tuebingen digital library to Wikimedia Commons. Also there is another task of migrating *already proofread Unicode text* to Wikisource. But to take up the Unicode migration first the scans need to be in Commons. I am making this request only because of the huge amount of pages that we need to handle. If it was just few hundreds of pages volunteers would have manually done it. Shiju On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <rkaldari(a)wikimedia.org> wrote: > >There is no good OCR for languages like Malayalam.

...

> > On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <shijualexonline(a)gmail.com> > wrote: > > > Hi > > > > Here are the answers > > > > What does "converted to Unicode" mean? Converted from what exactly? Do > > > you maybe mean "converted via OCR (Optical character recognition) from > > > images in file formats (JPG, PNG, images in a PDF) which don't allow > > > marking text to a file format which allows marking text in those files? > > > > > > There is no good OCR for languages like Malayalam. So each scanned image > is > > manually typed and proofread For example, See the 7th page of this book > > <http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. > > You > > can see the scan image on the right and the transcribed text for that > page > > on the left in the *Transcript *tab. This is done for 136 books, and > total > > pages on these books are close to 25,700 pages. > > > > What would you want the script to do exactly? Pull the files from the > > > Tuebingen Digital Library and then mass-upload these files to Commons? > > > > > > Yes, this is what is required. Unicode migration we will handle > separately. > > > > > > Shiju Alex > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <aklapper(a)wikimedia.org> > > wrote: > > > > > Hi, > > > > > > Great! Some questions below for better understanding what's wanted: > > > > > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote: > > > > Recently Tuebingen University > > > > <https://uni-tuebingen.de/en/university.html> (with > > > > the support from German Research Foundation) ran a project titled > > > *Gundert > > > > Legacy project* to digitize close to 137,000 pages from *850 public > > > domain > > > > books*. > > > > > > > > All these public domain books are in the South Indian languages > > > *Malayalam, > > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in > Malayalam, > > > 187 > > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu. > > > > > > > > Also there was a separate sub-project which was run as part of this > > > > project to convert 136 titles in Malayalam to Malayalam Unicode. The > > > number > > > > of pages that were converted to Unicode is close to *25,700* pages > .The > > > > Unicode conversion project was ran only for Malayalam. For the other > > > > languages it is just the scanning of books > > > > > > What does "converted to Unicode" mean? Converted from what exactly? Do > > > you maybe mean "converted via OCR (Optical character recognition) from > > > images in file formats (JPG, PNG, images in a PDF) which don't allow > > > marking text to a file format which allows marking text in those files? > > > > > > > The project is complete now and the results of the project is > available > > > in > > > > the Hermman Gundert Portal > https://www.gundert-portal.de/?language=en > > > which > > > > was released on Nov 20. A news report is available here. > > > > < > > > > > > https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mi… > > > > > > > > > > > > To view the books in each language you can navigate through the > various > > > > links in the portal. For example, malayalam books are available here: > > > > https://www.gundert-portal.de/?page=malayalam > > > > > > > > Now we need to upload these scans to Wikimedia Commons and Unicode > text > > > to > > > > Malayalam Wikisource (25,700 Unicode converted pages) > > > > > > > > The first priority is for the scans that are converted to Unicode. Is > > it > > > > possible to write a script to migrate the scans from Tuebingen > Digital > > > > library to Wikimedia Commons? (I can share the exact details of books > > > > converted to Unicode if needed) > > > > > > What would you want the script to do exactly? Pull the files from the > > > Tuebingen Digital Library and then mass-upload these files to Commons? > > > OCR (identify letters in pure images and converting those letters to > > > text which could be marked and copied)? Something else? > > > > > > To convert image files available on Wikimedia Commons to recognized > > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There > > > is also https://phabricator.wikimedia.org/T120788 for more info/tools. > > > > > > > All the digitized files are heavy and the size ranges from 100 MB to > > 1.5 > > > GB > > > > depending on the number of pages in the books. So manually managing > > this > > > is > > > > going to be a big challenge. > > > > > > > > Can some one help with this? > > > > > > Cheers, > > > andre > > > -- > > > Andre Klapper | Bugwrangler / Developer Advocate > > > https://blogs.gnome.org/aklapper/ > > > > > > > > > > > > _______________________________________________ > > > Wikitech-l mailing list > > > Wikitech-l(a)lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

bawolff

8:25 a.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading ? I think the folks at commons are more likely to be able to give you the help you need than wikitech-l would be. -- Brian On Mon, Dec 3, 2018 at 5:22 AM Shiju Alex <shijualexonline(a)gmail.com> wrote:

...

Shiju Alex

4:06 p.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

...

Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading the help you need than wikitech-l would be.

? I think the folks at commons are more likely to be able to give you Thank you. I was not aware about this option. Let me try this. Shiju Alex On Mon, Dec 3, 2018 at 1:55 PM bawolff <bawolff+wn(a)gmail.com> wrote:

...

> > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google

Vision

> API (which is usable from a Wikisource gadget > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam

Wilson)

will do OCR on Tamil. I can't vouch for these being "good", but they do exist.

Wikisource. But to take up the Unicode migration first the scans need to

in Commons. I am making this request only because of the huge amount of pages that we need to handle. If it was just few hundreds of pages volunteers would

have

manually done it. Shiju On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <rkaldari(a)wikimedia.org>

wrote:

> >There is no good OCR for languages like Malayalam. > > Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google

Vision

> API (which is usable from a Wikisource gadget > <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam

Wilson)

> will do OCR on Tamil. I can't vouch for these being "good", but they do > exist. > > On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <shijualexonline(a)gmail.com> > wrote: > > > Hi > > > > Here are the answers > > > > What does "converted to Unicode" mean? Converted from what exactly?

> > > you maybe mean "converted via OCR (Optical character recognition)

from

> > > images in file formats (JPG, PNG, images in a PDF) which don't

allow

> > > marking text to a file format which allows marking text in those

files?

> > > > > > There is no good OCR for languages like Malayalam. So each scanned

image

> is > > manually typed and proofread For example, See the 7th page of this

book

> > <

http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>ipt>.

> > You > > can see the scan image on the right and the transcribed text for that > page > > on the left in the *Transcript *tab. This is done for 136 books, and > total > > pages on these books are close to 25,700 pages. > > > > What would you want the script to do exactly? Pull the files from the > > > Tuebingen Digital Library and then mass-upload these files to

Commons?

Yes, this is what is required. Unicode migration we will handle

separately. > > > Shiju Alex > > > > > > > > > > > > > > > > > On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <aklapper(a)wikimedia.org

> > wrote: > > > > > Hi, > > > > > > Great! Some questions below for better understanding what's wanted: > > > > > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote: > > > > Recently Tuebingen University > > > > <https://uni-tuebingen.de/en/university.html> (with > > > > the support from German Research Foundation) ran a project titled > > > *Gundert > > > > Legacy project* to digitize close to 137,000 pages from *850

public

> > > domain > > > > books*. > > > > > > > > All these public domain books are in the South Indian languages > > > *Malayalam, > > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in > Malayalam, > > > 187 > > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu. > > > > > > > > Also there was a separate sub-project which was run as part of

this

> > > > project to convert 136 titles in Malayalam to Malayalam Unicode.

The

> > > number > > > > of pages that were converted to Unicode is close to *25,700*

pages

> .The > > > > Unicode conversion project was ran only for Malayalam. For the

other

> > > > languages it is just the scanning of books > > > > > > What does "converted to Unicode" mean? Converted from what

exactly? Do

> > > you maybe mean "converted via OCR (Optical character recognition)

from

> > > images in file formats (JPG, PNG, images in a PDF) which don't

allow

> > > marking text to a file format which allows marking text in those

files? > > > > > > > > > The project is complete now and the results of the project is > > available > > > > in > > > > > the Hermman Gundert Portal > > https://www.gundert-portal.de/?language=en > > > > which > > > > > was released on Nov 20. A news report is available here. > >

> > <

> > > > > > > > > https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mi…

> > > > > > > > > > > > To view the books in each language you can navigate through the > various > > > > links in the portal. For example, malayalam books are available

here:

> > > > https://www.gundert-portal.de/?page=malayalam > > > > > > > > Now we need to upload these scans to Wikimedia Commons and

Unicode

> text > > > to > > > > Malayalam Wikisource (25,700 Unicode converted pages) > > > > > > > > The first priority is for the scans that are converted to

Unicode. Is

> > it > > > > possible to write a script to migrate the scans from Tuebingen > Digital > > > > library to Wikimedia Commons? (I can share the exact details of

books

> > > > converted to Unicode if needed) > > > > > > What would you want the script to do exactly? Pull the files from

the

> > > Tuebingen Digital Library and then mass-upload these files to

Commons?

> > > OCR (identify letters in pure images and converting those letters

> > > text which could be marked and copied)? Something else? > > > > > > To convert image files available on Wikimedia Commons to recognized > > > text, see https://tools.wmflabs.org/ws-google-ocr/ for example.

There

> > > is also https://phabricator.wikimedia.org/T120788 for more

info/tools.

> > > > > > > All the digitized files are heavy and the size ranges from 100

MB to

> > 1.5 > > > GB > > > > depending on the number of pages in the books. So manually

managing

this > is > > going to be a big challenge. > > > > Can some one help with this? > > Cheers, > andre > -- > Andre Klapper | Bugwrangler / Developer Advocate > https://blogs.gnome.org/aklapper/ > > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Shrinivasan T

5 Dec 5 Dec

9:58 a.m.

New subject: Book scans from Tuebingen Digital Library to Wikimedia Commons

we used this script https://github.com/tshrinivasan/tools-for-wiki/tree/master/pdf-upload-commo… to upload some 2000 public domain tamil books to commons. Explore the batch uploading to commons. If it is not apt for you, I can help to customize this script. Regards, T. Shrinivasan

1993

days inactive

1996

days old

wikitech-l@lists.wikimedia.org

Manage subscription

7 comments

5 participants

tags (0)

participants (5)

Andre Klapper
bawolff
Ryan Kaldari
Shiju Alex
Shrinivasan T