Great, thank you for the news and congratulation for this achievement. :)
Le 05/01/2016 19:29, Bodhisattwa Mandal a écrit :
Hi,
I am happy to inform, that Shrinivasan has created a python script to
automate the process in Linux system. This scripts upload the PDF
files to Google Drive, download the OCRed text and split, merge the
text files properly to fit as the PDF file. We have just tested the
script for small files in Kannad and Bengali Wikisource and it was
successful. We are going to test the script for using different types
and sizes of files and in other Indic languages in next few days.
The script is in
https://github.com/tshrinivasan/OCR4wikisource
Regards,
Bodhisattwa
On 2 December 2015 at 17:21, Tobias Schönberg <tobias47n9e(a)gmail.com
<mailto:tobias47n9e@gmail.com>> wrote:
I think it is important for non-technical readers of this list to
separate the 2 issues in the discussion.
1) OCR-Integration
This is something WMF can help with, because they can make the
connection between an OCR service and Mediawiki easier and
automate certain steps.
2) OCR
WMF is not programming an OCR-software and it would probably be a
bad idea to reinvent the wheel. It would be far better if editors
reached out to existing ORC-software projects. Starting a
discussion or filing a bug is an important first step in improving
the situation.
Tesseract-OCR (
https://github.com/tesseract-ocr) for example is an
open-source project that works on OCR (No bugs filed for e.g.
Bengali). The mailing list
(
https://groups.google.com/forum/#!forum/tesseract-ocr
<https://groups.google.com/forum/#%21forum/tesseract-ocr>)
contains discussions about e.g. Bengali
(
https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali
<https://groups.google.com/forum/#%21searchin/tesseract-ocr/Bengali>).
So I think the situation might not be good, but is certainly on
its way of getting better.
Maybe WMF-India can fund a developer to work on Tesseract-OCR.
Another idea would be, to reach out to local universities. Maybe a
few informatics-students can improve the situation.
-Tobias
2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ)
<viswaprabha(a)gmail.com <mailto:viswaprabha@gmail.com>>:
From that page which, Alex has linked:
"On the other hand, using the service for converting document
formats /is/ SaaSS, because it's something you could have done
by running a suitable program (free, one hopes) in your own
computer."
Hundreds among us have burnt their hands in developing a
successful 'free' OCR tool for Indic languages without any
real luck until now.
Until such a tool appears on the horizon, the Google facility
is just okay to be used.
Especially so, because we are anyway dealing with 'free' input
and output material.
-Viswaprabha
On 1 December 2015 at 21:49, Bodhisattwa Mandal
<bodhisattwa.rgkmc(a)gmail.com
<mailto:bodhisattwa.rgkmc@gmail.com>> wrote:
Hi Alex,
Of course, building free OCR can be the only permanent
solution, but WMF is not interested in building new OCR
right now. The language engineering team said at the
conference that, they don't have the infrastructure and
expertise to build such software. That's why, we have to
rely on Google OCR, knowing very well about its profit
making intentions. It's just a temporary solution but
right now, its the only best possible alternative for us.
Regards
Bodhisattwa
On 1 Dec 2015 21:12, "Alex Brollo" <alex.brollo(a)gmail.com
<mailto:alex.brollo@gmail.com>> wrote:
... nevertheless I found very interesting this
about "SaaSS":
https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
So, to build a true, excellent and indipendent
"wikisource multilingual OCR service" would be a
better solution.
Alex
2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal
<bodhisattwa.rgkmc(a)gmail.com
<mailto:bodhisattwa.rgkmc@gmail.com>>:
Hi Nemo,
Thanks for your interest. You can find the list of
Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards,
Bodhisattwa
Thanks for posting about the topic. Which indic
languages are we talking about exactly? Are they
included in the recent FineReader versions now
used by Internet Archive?
Nemo
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
<mailto:Wikisource-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
--
Bodhisattwa Mandal
Administrator, Bengali Wikipedia
''Imagine a world in which every single person on the planet is given
free access to the sum of all human knowledge.''
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l