Hi,
The OCR4Wikisource script is evolving heavily. Already more than 1,50,000
pages have been OCRed in both Tamil and Bengali Wikisource using the
OCR4Wikisource script. The idea and the tool proved to be a game-changer
for Indic Wikisource projects.
And when we were getting some hope, Google announced that they will charge
for doing OCR using their drive.
https <https://cloud.google.com/vision/>://
<https://cloud.google.com/vision/>cloud.google.com
<https://cloud.google.com/vision/>/vision/
<https://cloud.google.com/vision/>
Is there any chance that WMF will go for negotiation with Google so that we
can do the mass OCR free of charge? I remember Asaf once told that this
possibility can be pursued. I think, now is the time to do that.
Regards,
Yeah!
I'm really happy that the BUB tool is resurrecting, and for the new OCR
script. Thanks everyone!
Aubrey
On Tue, Jan 5, 2016 at 9:53 PM, Asaf Bartov <abartov(a)wikimedia.org> wrote:
On Tue, Jan 5, 2016 at 10:29 AM, Bodhisattwa Mandal
<
bodhisattwa.rgkmc(a)gmail.com> wrote:
Hi,
I am happy to inform, that Shrinivasan has created a python script to
automate the process in Linux system. This scripts upload the PDF files to
Google Drive, download the OCRed text and split, merge the text files
properly to fit as the PDF file. We have just tested the script for small
files in Kannad and Bengali Wikisource and it was successful. We are going
to test the script for using different types and sizes of files and in
other Indic languages in next few days.
The script is in
https://github.com/tshrinivasan/OCR4wikisource
Fantastic news!
A.
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l