Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

5 Jan 2016

Great, thank you for the news and congratulation for this achievement. :)

Le 05/01/2016 19:29, Bodhisattwa Mandal a écrit :
...
  Hi,

 I am happy to inform, that Shrinivasan has created a python script to 
 automate the process in Linux system. This scripts upload the PDF 
 files to Google Drive, download the OCRed text and split, merge the 
 text files properly to fit as the PDF file. We have just tested the 
 script for small files in Kannad and Bengali Wikisource and it was 
 successful. We are going to test the script for using different types 
 and sizes of files and in other Indic languages in next few days.

 The script is in https://github.com/tshrinivasan/OCR4wikisource

 Regards,
 Bodhisattwa

 On 2 December 2015 at 17:21, Tobias Schönberg &lt;tobias47n9e(a)gmail.com 
 <mailto:tobias47n9e@gmail.com>> wrote:

     I think it is important for non-technical readers of this list to
     separate the 2 issues in the discussion.

     1) OCR-Integration
     This is something WMF can help with, because they can make the
     connection between an OCR service and Mediawiki easier and
     automate certain steps.

     2) OCR
     WMF is not programming an OCR-software and it would probably be a
     bad idea to reinvent the wheel. It would be far better if editors
     reached out to existing ORC-software projects. Starting a
     discussion or filing a bug is an important first step in improving
     the situation.
     Tesseract-OCR (https://github.com/tesseract-ocr) for example is an
     open-source project that works on OCR (No bugs filed for e.g.
     Bengali). The mailing list
     (https://groups.google.com/forum/#!forum/tesseract-ocr
     <https://groups.google.com/forum/#%21forum/tesseract-ocr>)
     contains discussions about e.g. Bengali
     (https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali
     <https://groups.google.com/forum/#%21searchin/tesseract-ocr/Bengali>).
     So I think the situation might not be good, but is certainly on
     its way of getting better.
     Maybe WMF-India can fund a developer to work on Tesseract-OCR.
     Another idea would be, to reach out to local universities. Maybe a
     few informatics-students can improve the situation.

     -Tobias

     2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ)
     &lt;viswaprabha(a)gmail.com <mailto:viswaprabha@gmail.com>>:

         From that page which, Alex has linked:
         "On the other hand, using the service for converting document
         formats /is/ SaaSS, because it's something you could have done
         by running a suitable program (free, one hopes) in your own
         computer."

         Hundreds among us have burnt their hands in developing a
         successful 'free' OCR tool for Indic languages without any
         real luck until now.
         Until such a tool appears on the horizon, the Google facility
         is just okay to be used.

         Especially so, because we are anyway dealing with 'free' input
         and output material.

         -Viswaprabha

         On 1 December 2015 at 21:49, Bodhisattwa Mandal
         &lt;bodhisattwa.rgkmc(a)gmail.com
         <mailto:bodhisattwa.rgkmc@gmail.com>> wrote:

             Hi Alex,

             Of course, building free OCR can be the only permanent
             solution, but WMF is not interested in building new OCR
             right now. The language engineering team said at the
             conference that, they don't have the infrastructure and
             expertise to build such software. That's why, we have to
             rely on Google OCR, knowing very well about its profit
             making intentions. It's just a temporary solution but
             right now, its the only best possible alternative for us.

             Regards
             Bodhisattwa

             On 1 Dec 2015 21:12, "Alex Brollo" &lt;alex.brollo(a)gmail.com
             <mailto:alex.brollo@gmail.com>> wrote:

                 ... nevertheless I found very interesting this
                 about "SaaSS":
                 https://www.gnu.org/philosophy/who-does-that-server-really-serve.html

                 So, to build a true, excellent and indipendent
                 "wikisource multilingual OCR service" would be a
                 better solution.

                 Alex

                 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal
                 &lt;bodhisattwa.rgkmc(a)gmail.com
                 <mailto:bodhisattwa.rgkmc@gmail.com>>:

                     Hi Nemo,

                     Thanks for your interest. You can find the list of
                     Google OCR supported languages in the following link -

                     https://support.google.com/drive/answer/176692?hl=en

                     Regards,
                     Bodhisattwa

                     Thanks for posting about the topic. Which indic
                     languages are we talking about exactly? Are they
                     included in the recent FineReader versions now
                     used by Internet Archive?

                     Nemo

                     _______________________________________________
                     Wikisource-l mailing list
                     Wikisource-l(a)lists.wikimedia.org
                     <mailto:Wikisource-l@lists.wikimedia.org>
                     https://lists.wikimedia.org/mailman/listinfo/wikisource-l

                     _______________________________________________
                     Wikisource-l mailing list
                     Wikisource-l(a)lists.wikimedia.org
                     <mailto:Wikisource-l@lists.wikimedia.org>
                     https://lists.wikimedia.org/mailman/listinfo/wikisource-l

                 _______________________________________________
                 Wikisource-l mailing list
                 Wikisource-l(a)lists.wikimedia.org
                 <mailto:Wikisource-l@lists.wikimedia.org>
                 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

             _______________________________________________
             Wikisource-l mailing list
             Wikisource-l(a)lists.wikimedia.org
             <mailto:Wikisource-l@lists.wikimedia.org>
             https://lists.wikimedia.org/mailman/listinfo/wikisource-l

         _______________________________________________
         Wikisource-l mailing list
         Wikisource-l(a)lists.wikimedia.org
         <mailto:Wikisource-l@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikisource-l

     _______________________________________________
     Wikisource-l mailing list
     Wikisource-l(a)lists.wikimedia.org
     <mailto:Wikisource-l@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 -- 
 Bodhisattwa Mandal
 Administrator, Bengali Wikipedia

 ''Imagine a world in which every single person on the planet is given 
 free access to the sum of all human knowledge.''

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist