Hi,
For a long time Indic languages Wikisource projects depended totally on manual proofreading, which not only wasted a lot of time, but also a lot of energy. Recently Google has released OCR software for more than 20 Indic languages, along with other Asian languages. This software is far far better and accurate than the previous OCRs. But it has many limitations. Uploading the same large file two times (one time for Google OCR and another at Commons) is not an easy solution for most of the contributors, as Internet connection is way slow in India. Now if we develop a tool which can feed the uploaded pdf or djvu files of Commons directly to Google OCRs, so that uploading them 2 times can be avoided.
This was proposed in 2015 community wishlist. Now, as the voting procedure for the wishlist has been started, the proposal needs your support. Please follow the link-
https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Wikisource#To...
FYI, this proposal was also accepted as a highest priority need at the 2015 Wikisource Conference in Vienna. (https://etherpad.wikimedia.org/p/wscon2015needs)
Regards
Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
_______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Bodhisattwa Mandal, 01/12/2015 16:06:
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
Yes but that's very generic, for instance they don't say what level of support they have. Most importantly, I was asking what languages are a) Indic, b) interesting for you, AND c) not supported by Internet Archive (FineReader).
Nemo
... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
So, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution.
Alex
2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal bodhisattwa.rgkmc@gmail.com:
Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi Alex,
Of course, building free OCR can be the only permanent solution, but WMF is not interested in building new OCR right now. The language engineering team said at the conference that, they don't have the infrastructure and expertise to build such software. That's why, we have to rely on Google OCR, knowing very well about its profit making intentions. It's just a temporary solution but right now, its the only best possible alternative for us.
Regards Bodhisattwa On 1 Dec 2015 21:12, "Alex Brollo" alex.brollo@gmail.com wrote:
... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
So, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution.
Alex
2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <bodhisattwa.rgkmc@gmail.com
:
Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
From that page which, Alex has linked:
"On the other hand, using the service for converting document formats *is* SaaSS, because it's something you could have done by running a suitable program (free, one hopes) in your own computer."
Hundreds among us have burnt their hands in developing a successful 'free' OCR tool for Indic languages without any real luck until now. Until such a tool appears on the horizon, the Google facility is just okay to be used.
Especially so, because we are anyway dealing with 'free' input and output material.
-Viswaprabha
On 1 December 2015 at 21:49, Bodhisattwa Mandal <bodhisattwa.rgkmc@gmail.com
wrote:
Hi Alex,
Of course, building free OCR can be the only permanent solution, but WMF is not interested in building new OCR right now. The language engineering team said at the conference that, they don't have the infrastructure and expertise to build such software. That's why, we have to rely on Google OCR, knowing very well about its profit making intentions. It's just a temporary solution but right now, its the only best possible alternative for us.
Regards Bodhisattwa On 1 Dec 2015 21:12, "Alex Brollo" alex.brollo@gmail.com wrote:
... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
So, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution.
Alex
2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com>:
Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
I think it is important for non-technical readers of this list to separate the 2 issues in the discussion.
1) OCR-Integration This is something WMF can help with, because they can make the connection between an OCR service and Mediawiki easier and automate certain steps.
2) OCR WMF is not programming an OCR-software and it would probably be a bad idea to reinvent the wheel. It would be far better if editors reached out to existing ORC-software projects. Starting a discussion or filing a bug is an important first step in improving the situation. Tesseract-OCR (https://github.com/tesseract-ocr) for example is an open-source project that works on OCR (No bugs filed for e.g. Bengali). The mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr) contains discussions about e.g. Bengali ( https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali). So I think the situation might not be good, but is certainly on its way of getting better. Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another idea would be, to reach out to local universities. Maybe a few informatics-students can improve the situation.
-Tobias
2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) viswaprabha@gmail.com:
From that page which, Alex has linked: "On the other hand, using the service for converting document formats *is* SaaSS, because it's something you could have done by running a suitable program (free, one hopes) in your own computer."
Hundreds among us have burnt their hands in developing a successful 'free' OCR tool for Indic languages without any real luck until now. Until such a tool appears on the horizon, the Google facility is just okay to be used.
Especially so, because we are anyway dealing with 'free' input and output material.
-Viswaprabha
On 1 December 2015 at 21:49, Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com> wrote:
Hi Alex,
Of course, building free OCR can be the only permanent solution, but WMF is not interested in building new OCR right now. The language engineering team said at the conference that, they don't have the infrastructure and expertise to build such software. That's why, we have to rely on Google OCR, knowing very well about its profit making intentions. It's just a temporary solution but right now, its the only best possible alternative for us.
Regards Bodhisattwa On 1 Dec 2015 21:12, "Alex Brollo" alex.brollo@gmail.com wrote:
... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
So, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution.
Alex
2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com>:
Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi,
I am happy to inform, that Shrinivasan has created a python script to automate the process in Linux system. This scripts upload the PDF files to Google Drive, download the OCRed text and split, merge the text files properly to fit as the PDF file. We have just tested the script for small files in Kannad and Bengali Wikisource and it was successful. We are going to test the script for using different types and sizes of files and in other Indic languages in next few days.
The script is in https://github.com/tshrinivasan/OCR4wikisource
Regards, Bodhisattwa
On 2 December 2015 at 17:21, Tobias Schönberg tobias47n9e@gmail.com wrote:
I think it is important for non-technical readers of this list to separate the 2 issues in the discussion.
- OCR-Integration
This is something WMF can help with, because they can make the connection between an OCR service and Mediawiki easier and automate certain steps.
- OCR
WMF is not programming an OCR-software and it would probably be a bad idea to reinvent the wheel. It would be far better if editors reached out to existing ORC-software projects. Starting a discussion or filing a bug is an important first step in improving the situation. Tesseract-OCR (https://github.com/tesseract-ocr) for example is an open-source project that works on OCR (No bugs filed for e.g. Bengali). The mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr) contains discussions about e.g. Bengali ( https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali). So I think the situation might not be good, but is certainly on its way of getting better. Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another idea would be, to reach out to local universities. Maybe a few informatics-students can improve the situation.
-Tobias
2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) viswaprabha@gmail.com :
From that page which, Alex has linked: "On the other hand, using the service for converting document formats *is* SaaSS, because it's something you could have done by running a suitable program (free, one hopes) in your own computer."
Hundreds among us have burnt their hands in developing a successful 'free' OCR tool for Indic languages without any real luck until now. Until such a tool appears on the horizon, the Google facility is just okay to be used.
Especially so, because we are anyway dealing with 'free' input and output material.
-Viswaprabha
On 1 December 2015 at 21:49, Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com> wrote:
Hi Alex,
Of course, building free OCR can be the only permanent solution, but WMF is not interested in building new OCR right now. The language engineering team said at the conference that, they don't have the infrastructure and expertise to build such software. That's why, we have to rely on Google OCR, knowing very well about its profit making intentions. It's just a temporary solution but right now, its the only best possible alternative for us.
Regards Bodhisattwa On 1 Dec 2015 21:12, "Alex Brollo" alex.brollo@gmail.com wrote:
... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.html
So, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution.
Alex
2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com>:
Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Great, thank you for the news and congratulation for this achievement. :)
Le 05/01/2016 19:29, Bodhisattwa Mandal a écrit :
Hi,
I am happy to inform, that Shrinivasan has created a python script to automate the process in Linux system. This scripts upload the PDF files to Google Drive, download the OCRed text and split, merge the text files properly to fit as the PDF file. We have just tested the script for small files in Kannad and Bengali Wikisource and it was successful. We are going to test the script for using different types and sizes of files and in other Indic languages in next few days.
The script is in https://github.com/tshrinivasan/OCR4wikisource
Regards, Bodhisattwa
On 2 December 2015 at 17:21, Tobias Schönberg <tobias47n9e@gmail.com mailto:tobias47n9e@gmail.com> wrote:
I think it is important for non-technical readers of this list to separate the 2 issues in the discussion. 1) OCR-Integration This is something WMF can help with, because they can make the connection between an OCR service and Mediawiki easier and automate certain steps. 2) OCR WMF is not programming an OCR-software and it would probably be a bad idea to reinvent the wheel. It would be far better if editors reached out to existing ORC-software projects. Starting a discussion or filing a bug is an important first step in improving the situation. Tesseract-OCR (https://github.com/tesseract-ocr) for example is an open-source project that works on OCR (No bugs filed for e.g. Bengali). The mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr <https://groups.google.com/forum/#%21forum/tesseract-ocr>) contains discussions about e.g. Bengali (https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali <https://groups.google.com/forum/#%21searchin/tesseract-ocr/Bengali>). So I think the situation might not be good, but is certainly on its way of getting better. Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another idea would be, to reach out to local universities. Maybe a few informatics-students can improve the situation. -Tobias 2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) <viswaprabha@gmail.com <mailto:viswaprabha@gmail.com>>: From that page which, Alex has linked: "On the other hand, using the service for converting document formats /is/ SaaSS, because it's something you could have done by running a suitable program (free, one hopes) in your own computer." Hundreds among us have burnt their hands in developing a successful 'free' OCR tool for Indic languages without any real luck until now. Until such a tool appears on the horizon, the Google facility is just okay to be used. Especially so, because we are anyway dealing with 'free' input and output material. -Viswaprabha On 1 December 2015 at 21:49, Bodhisattwa Mandal <bodhisattwa.rgkmc@gmail.com <mailto:bodhisattwa.rgkmc@gmail.com>> wrote: Hi Alex, Of course, building free OCR can be the only permanent solution, but WMF is not interested in building new OCR right now. The language engineering team said at the conference that, they don't have the infrastructure and expertise to build such software. That's why, we have to rely on Google OCR, knowing very well about its profit making intentions. It's just a temporary solution but right now, its the only best possible alternative for us. Regards Bodhisattwa On 1 Dec 2015 21:12, "Alex Brollo" <alex.brollo@gmail.com <mailto:alex.brollo@gmail.com>> wrote: ... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.html So, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution. Alex 2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <bodhisattwa.rgkmc@gmail.com <mailto:bodhisattwa.rgkmc@gmail.com>>: Hi Nemo, Thanks for your interest. You can find the list of Google OCR supported languages in the following link - https://support.google.com/drive/answer/176692?hl=en Regards, Bodhisattwa Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive? Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org <mailto:Wikisource-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Bodhisattwa Mandal Administrator, Bengali Wikipedia
''Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge.''
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com> wrote:
Hi,
I am happy to inform, that Shrinivasan has created a python script to automate the process in Linux system. This scripts upload the PDF files to Google Drive, download the OCRed text and split, merge the text files properly to fit as the PDF file. We have just tested the script for small files in Kannad and Bengali Wikisource and it was successful. We are going to test the script for using different types and sizes of files and in other Indic languages in next few days.
The script is in https://github.com/tshrinivasan/OCR4wikisource
Fantastic news!
A.
Yeah! I'm really happy that the BUB tool is resurrecting, and for the new OCR script. Thanks everyone!
Aubrey
On Tue, Jan 5, 2016 at 9:53 PM, Asaf Bartov abartov@wikimedia.org wrote:
On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com> wrote:
Hi,
I am happy to inform, that Shrinivasan has created a python script to automate the process in Linux system. This scripts upload the PDF files to Google Drive, download the OCRed text and split, merge the text files properly to fit as the PDF file. We have just tested the script for small files in Kannad and Bengali Wikisource and it was successful. We are going to test the script for using different types and sizes of files and in other Indic languages in next few days.
The script is in https://github.com/tshrinivasan/OCR4wikisource
Fantastic news!
A.
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi,
The OCR4Wikisource script is evolving heavily. Already more than 1,50,000 pages have been OCRed in both Tamil and Bengali Wikisource using the OCR4Wikisource script. The idea and the tool proved to be a game-changer for Indic Wikisource projects.
And when we were getting some hope, Google announced that they will charge for doing OCR using their drive. https https://cloud.google.com/vision/:// https://cloud.google.com/vision/cloud.google.com https://cloud.google.com/vision//vision/ https://cloud.google.com/vision/
Is there any chance that WMF will go for negotiation with Google so that we can do the mass OCR free of charge? I remember Asaf once told that this possibility can be pursued. I think, now is the time to do that.
Regards, Yeah! I'm really happy that the BUB tool is resurrecting, and for the new OCR script. Thanks everyone!
Aubrey
On Tue, Jan 5, 2016 at 9:53 PM, Asaf Bartov abartov@wikimedia.org wrote:
On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal < bodhisattwa.rgkmc@gmail.com> wrote:
Hi,
I am happy to inform, that Shrinivasan has created a python script to automate the process in Linux system. This scripts upload the PDF files to Google Drive, download the OCRed text and split, merge the text files properly to fit as the PDF file. We have just tested the script for small files in Kannad and Bengali Wikisource and it was successful. We are going to test the script for using different types and sizes of files and in other Indic languages in next few days.
The script is in https://github.com/tshrinivasan/OCR4wikisource
Fantastic news!
A.
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Bodhisattwa Mandal, 19/02/2016 18:02:
And when we were getting some hope, Google announced that they will charge for doing OCR using their drive. https https://cloud.google.com/vision/
Makes sense.
Is there any chance that WMF will go for negotiation with Google so that we can do the mass OCR free of charge?
What makes you think that Google's goals may match with ours?
Nemo
Le 20/02/2016 13:07, Federico Leva (Nemo) a écrit :
Bodhisattwa Mandal, 19/02/2016 18:02:
And when we were getting some hope, Google announced that they will charge for doing OCR using their drive. https https://cloud.google.com/vision/
Makes sense.
Is there any chance that WMF will go for negotiation with Google so that we can do the mass OCR free of charge?
What makes you think that Google's goals may match with ours?
Just asking will produce more certainty than speculating on matching agendas. :)
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi,
Of course, I am aware that Google' s goal does not match with ours. But I am talking about possibility of any negotiation in this matter because we don't have other options but to use the Google OCR tool. If we had other better OCR options, I would not raise the issue.
By the way, we are not using Cloud Vision API for the script now, so still we are doing it without paying any money, but this shows that may be in near future, we have to pay them. I am just being cautious in advance.
There may or may not be any negotiation, either way, we will utilise the Google OCR fully as far as we can. We will find other ways to do it.
Regards, On Feb 21, 2016 9:04 PM, "Mathieu Stumpf Guntz" < psychoslave@culture-libre.org> wrote:
Le 20/02/2016 13:07, Federico Leva (Nemo) a écrit :
Bodhisattwa Mandal, 19/02/2016 18:02:
And when we were getting some hope, Google announced that they will charge for doing OCR using their drive. https https://cloud.google.com/vision/
Makes sense.
Is there any chance that WMF will go for negotiation with Google so that we can do the mass OCR free of charge?
What makes you think that Google's goals may match with ours?
Just asking will produce more certainty than speculating on matching agendas. :)
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Bodhisattwa Mandal, 21/02/2016 17:13:
we don't have other options but to use the Google OCR tool.
This is not true, of course. There is always an alternative, the question is which alternative is worth pursuing.
Nemo
On Sun, Feb 21, 2016 at 9:01 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
This is not true, of course. There is always an alternative, the question is which alternative is worth pursuing.
Nemo, please be reasonable. If members from Indic communities say that there is an issue, and they have know that for years and tried to cope with it in different ways, and now they found that the Google OCR is finally available and working, we (who don't know the problem first-hand) should just shut up and listen. As a community, we should not presume just the good faith, sometimes also *the best knowledge* in things we don't even the basic literacy to understand.
I personally don't find any ethical issue in the WMF talking to Google about this: *language equity* (meaning a fundamental equality between languages) is a value per se, a value we should treasure as an international community.
So, statements like yours are probably said in good faith and spirit (I presume that because I know you and your rare, precious dedication to Wikimedia) but are not easy for others to understand. In the end, they are not helpful and feel harsh and negative. So please abstain or try to elaborate your point in a constructive, emphatic way. Thanks.
Aubrey
2016-02-21 21:21 GMT+01:00 Andrea Zanni zanni.andrea84@gmail.com:
On Sun, Feb 21, 2016 at 9:01 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
This is not true, of course. There is always an alternative, the question is which alternative is worth pursuing.
Nemo, please be reasonable.
+1. We need to be realistic.
Right now, and as far as I know, the only known alternatives are : - do nothing (it is indeed an alternative but a very bad one), - create a new OCR from scratch (probably the best option in the long run ; but something that will took at least years and a huge amount of resources nobody have ; not even FineReader, a big professional company which exist for 27 years and have more than 2000 employees).
Maybe there is other alternatives but no one has pointed even the beginning of an option.
Cdlt, ~nicolas
Nicolas VIGNERON, 21/02/2016 21:39:
Maybe there is other alternatives but no one has pointed even the beginning of an option.
It would be easier to point options if we had answers to the basic question "for which languages exactly do we need another OCR?". https://lists.wikimedia.org/pipermail/wikisource-l/2016-February/002712.html
If people on this list are unable/unwilling to answer, can someone suggest where else/how to get/build an answer?
Nemo
2016-02-21 22:58 GMT+01:00 Federico Leva (Nemo) nemowiki@gmail.com:
Nicolas VIGNERON, 21/02/2016 21:39:
Maybe there is other alternatives but no one has pointed even the beginning of an option.
It would be easier to point options if we had answers to the basic question "for which languages exactly do we need another OCR?".
https://lists.wikimedia.org/pipermail/wikisource-l/2016-February/002712.html
If people on this list are unable/unwilling to answer, can someone suggest where else/how to get/build an answer?
It seems to me that the answer have been already given : Ideally all the wikisources need an OCR and in particular the indic language have no free OCR (AFAIK) ; Bodhisattwa pointed to http://wiki.wikimedia.in/List_of_Indian_language_wiki_projects last december on the community wishlist and you give the other half of the answer yesterday.
Meanwhile, I agree with your very last part : we should put this somewhere public (on oldwikisource ? on meta ?) to have a broader view and gather more insight on the subject (not only indic).
Cdlt, ~nicolas
Hi,
Another alternative is development of open source good quality OCR or improve existing ones. Many tried in India and Bangladesh to create OCR by taking Government funds but no one knows what happened to those projects. We approached some of them but either they were reluctant to show the results or they did not bother. WMF and WMIN were also approached to develop the OCR, but we were said that they possess no such infrastructure and expertise to run the project.
Besides, developing new OCR will take a lot of time and we can't postpone our Wikisource projects based on it. We have already waited for a long time for a good quality OCR. Few months ago, we were typing every page of a novel word by word and that was our only way of proofreading. :-) But that's past now.
We always hope to get better alternatives and if we find any, we will definitely try to pursue it.
Regards, On Feb 22, 2016 1:32 AM, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Bodhisattwa Mandal, 21/02/2016 17:13:
we don't have other options but to use the Google OCR tool.
This is not true, of course. There is always an alternative, the question is which alternative is worth pursuing.
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Apologies for not replying earlier.
I have managed to get the attention of WMF staff and they have pushed this to the right section within WMF to talk to Google.
I suggest that we give them a week to get their head around the issues, and be able to ask questions.
This falls into the important, though not screamingly urgent, category.
We should have a Phabricator ticket for this. So we can track better.
--Billinghurst
On Mon, 22 Feb 2016 03:13 Bodhisattwa Mandal bodhisattwa.rgkmc@gmail.com wrote:
Hi,
Of course, I am aware that Google' s goal does not match with ours. But I am talking about possibility of any negotiation in this matter because we don't have other options but to use the Google OCR tool. If we had other better OCR options, I would not raise the issue.
By the way, we are not using Cloud Vision API for the script now, so still we are doing it without paying any money, but this shows that may be in near future, we have to pay them. I am just being cautious in advance.
There may or may not be any negotiation, either way, we will utilise the Google OCR fully as far as we can. We will find other ways to do it.
Regards, On Feb 21, 2016 9:04 PM, "Mathieu Stumpf Guntz" < psychoslave@culture-libre.org> wrote:
Le 20/02/2016 13:07, Federico Leva (Nemo) a écrit :
Bodhisattwa Mandal, 19/02/2016 18:02:
And when we were getting some hope, Google announced that they will charge for doing OCR using their drive. https https://cloud.google.com/vision/
Makes sense.
Is there any chance that WMF will go for negotiation with Google so that we can do the mass OCR free of charge?
What makes you think that Google's goals may match with ours?
Just asking will produce more certainty than speculating on matching agendas. :)
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi Nemo,
Please follow this link also,
http://cis-india.org/a2k/blogs/googles-optical-character-recognition-softwar...
Regards, Bodhisattwa On 1 Dec 2015 20:29, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi Nemo,
1) Indic languages are basically all languages of Indian subcontinent like Hindi, Sanskrit, Urdu, Punjabi, Gujarati, Marathi, Tamil, Telugu, Kannad, Malayalam, Bengali, Odia, Assamese etc.
2)My specific interest is in Bengali Language.
3) I cannot tell about other Indic languages, but I can say that Bengali is not included in FineReader version of IA.
Regards, Bodhisattwa On 1 Dec 2015 20:57, "Bodhisattwa Mandal" bodhisattwa.rgkmc@gmail.com wrote:
Hi Nemo,
Please follow this link also,
http://cis-india.org/a2k/blogs/googles-optical-character-recognition-softwar...
Regards, Bodhisattwa On 1 Dec 2015 20:29, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Thanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Bodhisattwa Mandal, 01/12/2015 16:35:
- I cannot tell about other Indic languages, but I can say that Bengali
is not included in FineReader version of IA.
Ok, thanks for answering my question. What other languages are we interested in that are missing? See http://www.abbyy.com/support/finereader/11/rl/ for the list.
Nemo
Hi
Just wanted to introduce the bub tool on tools lab. It downloads the books from google-books nd some other libraries and then uploads it to the Internet archive for OCR. (after that tpt's ia-upload tool can be used for commons upload) The tool was down for a long time, but its getting ready again.(few fixes more needed)
Hope it'll be useful for the community again.
- Rohit On 6 Jan 2016 01:22, "Federico Leva (Nemo)" nemowiki@gmail.com wrote:
Bodhisattwa Mandal, 01/12/2015 16:35:
- I cannot tell about other Indic languages, but I can say that Bengali
is not included in FineReader version of IA.
Ok, thanks for answering my question. What other languages are we interested in that are missing? See http://www.abbyy.com/support/finereader/11/rl/ for the list.
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Trying to answer myself with some cleaning. So, which existing or potential Wikisources lack OCR with FineReader (and Tesseract?) but are interested in Google's, beyond Bengali?
== Languages supported by FineReader, not Google ==
Abkhaz Adyghe Agul Altaic Avar Blackfoot Bugotu Buryat Chamorro Chukchee Corsican Crow Dargwa Dungan Dutch (Belgium) Eskimo (Cyrillic, Latin) Even Evenki Frisian Friulian Gagauz German (Luxemburg) German (old spelling) Hani Ido Ingush Interlingua Jingpo Kabardian Kalmyk Karachay-balkar Kasub Kawa Khakass Khanty Korean (Hangul) Koryak Kpelle Kumyk Kurdish Lak Lezgi Luba Malinke Mansi Mari Maya Miao Moldavian Mordvin Nenets Nivkh Nogay Norwegian (Nynorsk) Occidental Ojibway Ossetian Provencal Rhaeto-romanic Rwanda Sami (Lappish) Selkup Somali Sorbian Sotho Sunda Tabasaran Tagalog Tok Pisin Tun Turkmen (Latin) Tuvinian Udmurt Uighur (Cyrillic, Latin) Ukrainian Yakut
== Languages supported by Google, not FineReader ==
Acehnese Acholi Adangme Akan Algonquinian Amharic Ancient Greek Araucanian/Mapuche Assamese Asturian Athabaskan Balinese Bambara Bantu Batak Bengali Bikol Bislama Bosnian Burmese Cherokee Chinese (Mandarin; Hong Kong) Choctaw Cree Creek Dhivehi Duala Dzonkha Efik Ewe Filipino Fon Fulah Ga Gayo Georgian Gilbertese Gothic Gujarati Haitian Creole Herero Hiligaynon Hindi Iban Igbo Iloko Javanese Kabyle Kachin Kalaallisut Kamba Kannada Kanuri Khasi Khmer Kinyarwanda Komi Kosraean Kuanyama Lao Lingala Low German Lozi Luba-Katanga Luo Madurese Malayalam Mandingo Manx Marathi Marshallese Mende Middle English Middle High German Mongo Navajo Ndonga Nepali Niuean Northern Sotho North Ndebele Nyankole Nyasa Tonga Nzima Ojibwa Old English Old French Old High German Old Norse Old Provencal Oriya Ossetic Pampanga Pangasinan Pashto Persian Punjabi (Gurmukhi) Romansh Sakha Sango Sanskrit Scots Sinhala Songhai Southern Sotho Sundanese Tamil Telugu Temne Tibetan Tigirinya Tsonga Udmurt Ukrainian Urdu Venda Votic Western Frisian Yoruba
wikisource-l@lists.wikimedia.org