Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist

5 Jan 2016

Hi,

I am happy to inform, that Shrinivasan has created a python script to
automate the process in Linux system. This scripts upload the PDF files to
Google Drive, download the OCRed text and split, merge the text files
properly to fit as the PDF file. We have just tested the script for small
files in Kannad and Bengali Wikisource and it was successful. We are going
to test the script for using different types and sizes of files and in
other Indic languages in next few days.

The script is in https://github.com/tshrinivasan/OCR4wikisource

Regards,
Bodhisattwa

On 2 December 2015 at 17:21, Tobias Schönberg &lt;tobias47n9e(a)gmail.com&gt; wrote:

...
  I think it is important for non-technical readers of
this list to separate
 the 2 issues in the discussion.

 1) OCR-Integration
 This is something WMF can help with, because they can make the connection
 between an OCR service and Mediawiki easier and automate certain steps.

 2) OCR
 WMF is not programming an OCR-software and it would probably be a bad idea
 to reinvent the wheel. It would be far better if editors reached out to
 existing ORC-software projects. Starting a discussion or filing a bug is an
 important first step in improving the situation.
 Tesseract-OCR (https://github.com/tesseract-ocr) for example is an
 open-source project that works on OCR (No bugs filed for e.g. Bengali). The
 mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr)
 contains discussions about e.g. Bengali (
 https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali). So I
 think the situation might not be good, but is certainly on its way of
 getting better.
 Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another
 idea would be, to reach out to local universities. Maybe a few
 informatics-students can improve the situation.

 -Tobias

 2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) &lt;viswaprabha(a)gmail.com&gt;
 :

  From that page which, Alex has linked:
 "On the other hand, using the service for converting document formats
 *is* SaaSS, because it's something you could have done by running a
 suitable program (free, one hopes) in your own computer."

 Hundreds among us have burnt their hands in developing a successful
 'free' OCR tool for Indic languages without any real luck until now.
 Until such a tool appears on the horizon, the Google facility is just
 okay to be used.

 Especially so, because we are anyway dealing with 'free' input and output
 material.

 -Viswaprabha

 On 1 December 2015 at 21:49, Bodhisattwa Mandal <
 bodhisattwa.rgkmc(a)gmail.com&gt; wrote:

  Hi Alex,

 Of course, building free OCR can be the only permanent solution, but WMF
 is not interested in building new OCR right now. The language engineering
 team said at the conference that, they don't have the infrastructure and
 expertise to build such software. That's why, we have to rely on Google
 OCR, knowing very well about its profit making intentions. It's just a
 temporary solution but right now, its the only best possible alternative
 for us.

 Regards
 Bodhisattwa
 On 1 Dec 2015 21:12, "Alex Brollo" &lt;alex.brollo(a)gmail.com&gt; wrote:

  ... nevertheless I found very interesting this
about "SaaSS":
 https://www.gnu.org/philosophy/who-does-that-server-really-serve.html

 So, to build a true, excellent and indipendent "wikisource multilingual
 OCR service" would be a better solution.

 Alex

 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <
 bodhisattwa.rgkmc(a)gmail.com&gt;gt;:

> Hi Nemo,
>
> Thanks for your interest. You can find the list of Google OCR
> supported languages in the following link -
>
> https://support.google.com/drive/answer/176692?hl=en
>
> Regards,
> Bodhisattwa
> Thanks for posting about the topic. Which indic languages are we
> talking about exactly? Are they included in the recent FineReader versions
> now used by Internet Archive?
>
> Nemo
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

  _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- 
Bodhisattwa Mandal
Administrator, Bengali Wikipedia

''Imagine a world in which every single person on the planet is given free
access to the sum of all human knowledge.''

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Vote for Google OCR-Wikisource integration in 2015 community wishlist