? I think the folks at commons are more likely to be able to give you
Thank you. I was not aware about this option. Let me try this.
Shiju Alex
On Mon, Dec 3, 2018 at 1:55 PM bawolff <bawolff+wn(a)gmail.com> wrote:
Have you seen
https://commons.wikimedia.org/wiki/Commons:Batch_uploading
? I think the folks at commons are more likely to be able to give you
the help you need than wikitech-l would be.
--
Brian
On Mon, Dec 3, 2018 at 5:22 AM Shiju Alex <shijualexonline(a)gmail.com>
wrote:
>
> Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
Vision
> API (which is usable from a Wikisource
gadget
> <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
Wilson)
will do
OCR on Tamil. I can't vouch for these being "good", but they do
exist.
The request in this post is not for creating an OCR for any language
script; but to migrate certain Public Domain book scans from Tuebingen
digital library to Wikimedia Commons.
Also there is another task of migrating *already proofread Unicode text*
to
Wikisource. But to take up the Unicode migration
first the scans need to
be
in Commons.
I am making this request only because of the huge amount of pages that we
need to handle. If it was just few hundreds of pages volunteers would
have
manually done it.
Shiju
On Mon, Dec 3, 2018 at 10:01 AM Ryan Kaldari <rkaldari(a)wikimedia.org>
wrote:
> >There is no good OCR for languages like Malayalam.
>
> Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
Vision
> API (which is usable from a Wikisource
gadget
> <https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
Wilson)
> will do OCR on Tamil. I can't vouch for
these being "good", but they do
> exist.
>
> On Sun, Dec 2, 2018 at 2:54 AM Shiju Alex <shijualexonline(a)gmail.com>
> wrote:
>
> > Hi
> >
> > Here are the answers
> >
> > What does "converted to Unicode" mean? Converted from what exactly?
Do
> > > you maybe mean "converted via
OCR (Optical character recognition)
from
> > > images in file formats (JPG, PNG,
images in a PDF) which don't
allow
> > > marking text to a file format
which allows marking text in those
files?
> >
> >
> > There is no good OCR for languages like Malayalam. So each scanned
image
> is
> > manually typed and proofread For example, See the 7th page of this
book
> > <
http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>ipt>.
> > You
> > can see the scan image on the right and the transcribed text for that
> page
> > on the left in the *Transcript *tab. This is done for 136 books, and
> total
> > pages on these books are close to 25,700 pages.
> >
> > What would you want the script to do exactly? Pull the files from the
> > > Tuebingen Digital Library and then mass-upload these files to
Commons?
Yes, this is what is required. Unicode migration we will handle
separately.
>
>
> Shiju Alex
>
>
>
>
>
> >
>
>
>
>
>
>
>
>
>
> On Sun, Dec 2, 2018 at 4:07 PM Andre Klapper <aklapper(a)wikimedia.org
> > wrote:
> >
> > > Hi,
> > >
> > > Great! Some questions below for better understanding what's wanted:
> > >
> > > On Sun, 2018-12-02 at 15:22 +0530, Shiju Alex wrote:
> > > > Recently Tuebingen University
> > > > <https://uni-tuebingen.de/en/university.html> (with
> > > > the support from German Research Foundation) ran a project titled
> > > *Gundert
> > > > Legacy project* to digitize close to 137,000 pages from *850
public
> > > domain
> > > > books*.
> > > >
> > > > All these public domain books are in the South Indian languages
> > > *Malayalam,
> > > > Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
> Malayalam,
> > > 187
> > > > in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
> > > >
> > > > Also there was a separate sub-project which was run as part of
this
> > > > project to convert 136 titles
in Malayalam to Malayalam Unicode.
The
> > > number
> > > > of pages that were converted to Unicode is close to *25,700*
pages
> .The
> > > > Unicode conversion project was ran only for Malayalam. For the
other
> > > > languages it is just the
scanning of books
> > >
> > > What does "converted to Unicode" mean? Converted from what
exactly? Do
> > > you maybe mean "converted via
OCR (Optical character recognition)
from
> > > images in file formats (JPG, PNG,
images in a PDF) which don't
allow
> > > marking text to a file format
which allows marking text in those
files?
> > > >
> > > > > The project is complete now and the results of the project is
> > available
> > > > in
> > > > > the Hermman Gundert Portal
> >
https://www.gundert-portal.de/?language=en
> > > > which
> > > > > was released on Nov 20. A news report is available here.
> >
> > <
> > >
>
> > >
> >
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mi…
> > > >
> > > >
> > > > To view the books in each language you can navigate through the
> various
> > > > links in the portal. For example, malayalam books are available
here:
> > > >
https://www.gundert-portal.de/?page=malayalam
> > > >
> > > > Now we need to upload these scans to Wikimedia Commons and
Unicode
> text
> > > to
> > > > Malayalam Wikisource (25,700 Unicode converted pages)
> > > >
> > > > The first priority is for the scans that are converted to
Unicode. Is
> > it
> > > > possible to write a script to migrate the scans from Tuebingen
> Digital
> > > > library to Wikimedia Commons? (I can share the exact details of
books
> > > > converted to Unicode if
needed)
> > >
> > > What would you want the script to do exactly? Pull the files from
the
> > > Tuebingen Digital Library and then
mass-upload these files to
Commons?
> > > OCR (identify letters in pure
images and converting those letters
to
> > > text which could be marked and
copied)? Something else?
> > >
> > > To convert image files available on Wikimedia Commons to recognized
> > > text, see
https://tools.wmflabs.org/ws-google-ocr/ for example.
There
info/tools.
> > >
> > > > All the digitized files are heavy and the size ranges from 100
MB to
> > 1.5
> > > GB
> > > > depending on the number of pages in the books. So manually
managing
this
> is
> > going to be a big challenge.
> >
> > Can some one help with this?
>
> Cheers,
> andre
> --
> Andre Klapper | Bugwrangler / Developer Advocate
>
https://blogs.gnome.org/aklapper/
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l