Indic print material digitization workshop query

List overview All Threads
Download

newer

older

FW: Fwd: Help, please, with...

Translations! Join us at the...

L. Shyamal

18 Aug 2013 18 Aug '13

11:52 p.m.

Re-posting a now outdated query from meta http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language OCR-ing - if it did it would be worthwhile if the OCR software used can be documented on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of scanners for creating PDF documents and uploading them to places like the Internet Archive but the experience or knowledge of OCRs and their success rates is a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

Attachments:

attachment.htm (text/html — 976 bytes)

Show replies by date

sankarshan

19 Aug 19 Aug

1:24 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

On Mon, Aug 19, 2013 at 12:22 PM, L. Shyamal lshyamal@gmail.com wrote:

...

Re-posting a now outdated query from meta http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

The phrase "creating text based documents" which forms the basis of the question would require further explanation. OCR would include both digitization and, information retrieval.

-- sankarshan mukhopadhyay https://twitter.com/#!/sankarshan

Sumana Harihareswara

9:42 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

On 08/19/2013 02:52 AM, L. Shyamal wrote:

...

Re-posting a now outdated query from meta http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language OCR-ing - if it did it would be worthwhile if the OCR software used can be documented on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of scanners for creating PDF documents and uploading them to places like the Internet Archive but the experience or knowledge of OCRs and their success rates is a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Ashwin Baindur

9:58 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

Whether to OCR or not to OCR is a significant issue! When we OCR a page of text, the resultant is often error-prone, lost formatting, and the correction requires crowd-sourced correction. Many of us know about Project Gutenberg. The site provides plain vanilla etexts. But what most people do not know that one of the very first crowd-sourcing initiatives - "Distributed Proof-readers" provides a huge volunteer community correcting OCR pages of text submitted to Project Gutenberg. In fact, I was a Distributed Proofreader before coming to Wikipedia and that was my first crowd-sourced experience.

http://www.pgdp.net/c/

I've also done digitisation in a government archive for five years. We took a conscious decision to OCR the text and allow the uncorrected layer to exist rather than take the pains to correct it. The material was used so infrequently, it made good sense for the end-user to proof-read himself should he desire to do so. So the real challenge in digitisation is not OCR, or rather, not just OCR but the creation of an error-free proof-read text layer behind the pdf/other formatted archive document.

Ashwin Baindur

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:

...

On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language OCR-ing

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of scanners for creating PDF documents and uploading them to places like the Internet Archive but the experience or knowledge of OCRs and their success rates

is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Warm regards, Ashwin Baindur ------------------------------------------------------

Aarti K. Dwivedi

10:08 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

Hi Everyone,

In my opinion, it is always better to OCR the documents. I agree that it's error prone but there is a Google Summer of Code project being done by AnkurIndia whose aim is to improve the quality of OCRs for Indian scripts. https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001

So, maybe not immediately but in short time, OCR is worth it. I am not aware if any Wikisource in Indian languages is as vast as French, English or Italian Wikisource. But we should have it because we have quite a lot of text.

Thank You, Aarti

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur ashwin.baindur@gmail.comwrote:

...

Whether to OCR or not to OCR is a significant issue! When we OCR a page of text, the resultant is often error-prone, lost formatting, and the correction requires crowd-sourced correction. Many of us know about Project Gutenberg. The site provides plain vanilla etexts. But what most people do not know that one of the very first crowd-sourcing initiatives - "Distributed Proof-readers" provides a huge volunteer community correcting OCR pages of text submitted to Project Gutenberg. In fact, I was a Distributed Proofreader before coming to Wikipedia and that was my first crowd-sourced experience.

http://www.pgdp.net/c/

I've also done digitisation in a government archive for five years. We took a conscious decision to OCR the text and allow the uncorrected layer to exist rather than take the pains to correct it. The material was used so infrequently, it made good sense for the end-user to proof-read himself should he desire to do so. So the real challenge in digitisation is not OCR, or rather, not just OCR but the creation of an error-free proof-read text layer behind the pdf/other formatted archive document.

Ashwin Baindur

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:

...
On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language

OCR-ing -

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of

scanners

...
for creating PDF documents and uploading them to places like the

Internet

...
Archive but the experience or knowledge of OCRs and their success rates

is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Warm regards,

Ashwin Baindur

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Aarti K. Dwivedi

Aarti K. Dwivedi

20 Aug 20 Aug

5:31 a.m.

New subject: [Wikimediaindia-l] Fwd: Indic print material digitization workshop query

Hi Everyone,

Thank You, Aarti

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur ashwin.baindur@gmail.comwrote:

...

Whether to OCR or not to OCR is a significant issue! When we OCR a page of text, the resultant is often error-prone, lost formatting, and the correction requires crowd-sourced correction. Many of us know about Project Gutenberg. The site provides plain vanilla etexts. But what most people do not know that one of the very first crowd-sourcing initiatives - "Distributed Proof-readers" provides a huge volunteer community correcting OCR pages of text submitted to Project Gutenberg. In fact, I was a Distributed Proofreader before coming to Wikipedia and that was my first crowd-sourced experience.

http://www.pgdp.net/c/

I've also done digitisation in a government archive for five years. We took a conscious decision to OCR the text and allow the uncorrected layer to exist rather than take the pains to correct it. The material was used so infrequently, it made good sense for the end-user to proof-read himself should he desire to do so. So the real challenge in digitisation is not OCR, or rather, not just OCR but the creation of an error-free proof-read text layer behind the pdf/other formatted archive document.

Ashwin Baindur

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:

...
On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language

OCR-ing -

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of

scanners

...
for creating PDF documents and uploading them to places like the

Internet

...
Archive but the experience or knowledge of OCRs and their success rates

is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Warm regards,

Ashwin Baindur

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Aarti K. Dwivedi -- Aarti K. Dwivedi

Tejaswini Niranjana

10:53 p.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

Colleagues working in Bangla say that in their experience it is faster, cheaper, and less error-prone to create digital texts by typing them in. Once there is a larger body of digitised texts, and OCR technology for Indian languages also improves, OCR could become the preferred option.

Tejaswini

On 19 August 2013 22:38, Aarti K. Dwivedi ellydwivedi2093@gmail.com wrote:

...

Hi Everyone,
 In my opinion, it is always better to OCR  the documents. I agree
that it's error prone but there is a Google Summer of Code project being done by AnkurIndia whose aim is to improve the quality of OCRs for Indian scripts. https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001

So, maybe not immediately but in short time, OCR is worth it. I am not aware if any Wikisource in Indian languages is as vast as French, English or Italian Wikisource. But we should have it because we have quite a lot of text.

Thank You, Aarti

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur <ashwin.baindur@gmail.com

...
wrote:

...
Whether to OCR or not to OCR is a significant issue! When we OCR a page of text, the resultant is often error-prone, lost formatting, and the correction requires crowd-sourced correction. Many of us know about Project Gutenberg. The site provides plain vanilla etexts. But what most people do not know that one of the very first crowd-sourcing initiatives - "Distributed Proof-readers" provides a huge volunteer community correcting OCR pages of text submitted to Project Gutenberg. In fact, I was a Distributed Proofreader before coming to Wikipedia and that was my first crowd-sourced experience.

http://www.pgdp.net/c/

I've also done digitisation in a government archive for five years. We took a conscious decision to OCR the text and allow the uncorrected layer to exist rather than take the pains to correct it. The material was used so infrequently, it made good sense for the end-user to proof-read himself should he desire to do so. So the real challenge in digitisation is not OCR, or rather, not just OCR but the creation of an error-free proof-read text layer behind the pdf/other formatted archive document.

Ashwin Baindur

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:

...
On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that

have

...
attended the workshop could comment if this cover Indic language

OCR-ing -

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of

scanners

...
for creating PDF documents and uploading them to places like the

Internet

...
Archive but the experience or knowledge of OCRs and their success

rates is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Warm regards,

Ashwin Baindur

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Aarti K. Dwivedi

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Tejaswini Niranjana, PhD Lead Researcher - Higher Education Innovation and Research Applications (HEIRA) Senior Fellow - Centre for the Study of Culture and Society (CSCS) Visiting Professor - Tata Institute of Social Sciences (TISS) Advisor, Access to Knowledge Programme, Centre for Internet and Society Visiting Faculty - Centre for Contemporary Studies, Indian Institute of Science (CCS-IISc) t: 91-80-41202302 http://heira.in www.cscs.res.in

sankarshan

11:55 p.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

On Wed, Aug 21, 2013 at 11:23 AM, Tejaswini Niranjana teju@cscs.res.in wrote:

...

Colleagues working in Bangla say that in their experience it is faster, cheaper, and less error-prone to create digital texts by typing them in.

The "cheaper" is an interesting word to use in this context. Are we still arbitraging on the low-cost of human labor to input texts? It is faster/less error-prone because you are comparing against systems that are vaporware.

...

Once there is a larger body of digitised texts, and OCR technology for Indian languages also improves, OCR could become the preferred option.

With the strong emphasis on "cheaper", I wonder if there will be enough demand for a machine driven system.

-- sankarshan mukhopadhyay https://twitter.com/#!/sankarshan

Pavanaja U B

21 Aug 21 Aug

12:01 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

I second Tejaswini. Those who are working on Kannada OCR development also say the same.

Regards,

Pavanaja

From: wikimediaindia-l-bounces@lists.wikimedia.org [mailto:wikimediaindia-l-bounces@lists.wikimedia.org] On Behalf Of Tejaswini Niranjana Sent: 21 August 2013 11:24 To: Wikimedia India Community list Subject: Re: [Wikimediaindia-l] Indic print material digitization workshop query

Tejaswini

On 19 August 2013 22:38, Aarti K. Dwivedi ellydwivedi2093@gmail.com wrote:

Hi Everyone,

In my opinion, it is always better to OCR the documents. I agree that it's error prone but there is a

Google Summer of Code project being done by AnkurIndia whose aim is to improve the quality of OCRs

for Indian scripts. https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001

So, maybe not immediately but in short time, OCR is worth it. I am not aware if any Wikisource in Indian

languages is as vast as French, English or Italian Wikisource. But we should have it because we have quite

a lot of text.

Thank You,

Aarti

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur ashwin.baindur@gmail.com wrote:

http://www.pgdp.net/c/

Ashwin Baindur

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara sumanah@wikimedia.org wrote:

On 08/19/2013 02:52 AM, L. Shyamal wrote:

...

Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo re/Digitization_workshop_18August2013

...

now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language OCR-ing - if it did it would be worthwhile if the OCR software used can be

documented

...

on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of scanners for creating PDF documents and uploading them to places like the Internet Archive but the experience or knowledge of OCRs and their success rates is a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

_______________________________________________ Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- Warm regards, Ashwin Baindur ------------------------------------------------------ _______________________________________________ Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l -- Aarti K. Dwivedi _______________________________________________ Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l -- Tejaswini Niranjana, PhD Lead Researcher - Higher Education Innovation and Research Applications (HEIRA) Senior Fellow - Centre for the Study of Culture and Society (CSCS) Visiting Professor - Tata Institute of Social Sciences (TISS) Advisor, Access to Knowledge Programme, Centre for Internet and Society Visiting Faculty - Centre for Contemporary Studies, Indian Institute of Science (CCS-IISc) t: 91-80-41202302 http://heira.in http://heira.in/ www.cscs.res.in http://www.cscs.res.in/

Dhaval S. Vyas

12:05 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

While saying "cheaper" are we considering recurring cost of human labour (for which future is uncertain) or just taking in account the initial one off cost of software development?

Regards, Dhaval On 21 Aug 2013 08:01, "Pavanaja U B" pavanaja@vishvakannada.com wrote:

...

I second Tejaswini. Those who are working on Kannada OCR development also say the same.****

Regards,****

Pavanaja****

*From:* wikimediaindia-l-bounces@lists.wikimedia.org [mailto: wikimediaindia-l-bounces@lists.wikimedia.org] *On Behalf Of *Tejaswini Niranjana *Sent:* 21 August 2013 11:24 *To:* Wikimedia India Community list *Subject:* Re: [Wikimediaindia-l] Indic print material digitization workshop query****

Colleagues working in Bangla say that in their experience it is faster, cheaper, and less error-prone to create digital texts by typing them in. Once there is a larger body of digitised texts, and OCR technology for Indian languages also improves, OCR could become the preferred option. ***

Tejaswini****

On 19 August 2013 22:38, Aarti K. Dwivedi ellydwivedi2093@gmail.com wrote:****

Hi Everyone,****
 In my opinion, it is always better to OCR  the documents. I agree
that it's error prone but there is a****

Google Summer of Code project being done by AnkurIndia whose aim is to improve the quality of OCRs****

for Indian scripts. https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001***

So, maybe not immediately but in short time, OCR is worth it. I am not aware if any Wikisource in Indian****

languages is as vast as French, English or Italian Wikisource. But we should have it because we have quite****

a lot of text.****

Thank You,****

Aarti****

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur ashwin.baindur@gmail.com wrote:****

Whether to OCR or not to OCR is a significant issue! When we OCR a page of text, the resultant is often error-prone, lost formatting, and the correction requires crowd-sourced correction. Many of us know about Project Gutenberg. The site provides plain vanilla etexts. But what most people do not know that one of the very first crowd-sourcing initiatives - "Distributed Proof-readers" provides a huge volunteer community correcting OCR pages of text submitted to Project Gutenberg. In fact, I was a Distributed Proofreader before coming to Wikipedia and that was my first crowd-sourced experience.****

http://www.pgdp.net/c/****

I've also done digitisation in a government archive for five years. We took a conscious decision to OCR the text and allow the uncorrected layer to exist rather than take the pains to correct it. The material was used so infrequently, it made good sense for the end-user to proof-read himself should he desire to do so. So the real challenge in digitisation is not OCR, or rather, not just OCR but the creation of an error-free proof-read text layer behind the pdf/other formatted archive document.****

Ashwin Baindur****

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:****

On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language OCR-ing

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of scanners for creating PDF documents and uploading them to places like the Internet Archive but the experience or knowledge of OCRs and their success rates

is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal****

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****

-- Warm regards,

Ashwin Baindur ------------------------------------------------------ ****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****

-- ****

Aarti K. Dwivedi****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****

-- ****

Tejaswini Niranjana, PhD Lead Researcher - Higher Education Innovation and Research Applications (HEIRA) Senior Fellow - Centre for the Study of Culture and Society (CSCS) Visiting Professor - Tata Institute of Social Sciences (TISS)****

Advisor, Access to Knowledge Programme, Centre for Internet and Society Visiting Faculty - Centre for Contemporary Studies, Indian Institute of Science (CCS-IISc)

t: 91-80-41202302 http://heira.in www.cscs.res.in****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Vishnu T

12:52 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

There is no need for an Either-Or approach, I think. While development of effective OCR for Indian languages, (Sankarshan has given a comprehensive overview of the developments on this post earlier), should be encouraged; for the immediate need typing is still an effective option.

I read "cheaper" in the context of the humongous amounts spent on OCR by various agencies, especially GOI, over the last decade and we are yet to see effective results. Putting together a list of all the OCR developments in Indian Languages and the expenses incurred (status paper?), esp. by the GoI, could be also useful in thinking about future work on this.

Best, Vishnu

On 21 August 2013 12:35, Dhaval S. Vyas dsvyas@gmail.com wrote:

...

While saying "cheaper" are we considering recurring cost of human labour (for which future is uncertain) or just taking in account the initial one off cost of software development?

Regards, Dhaval On 21 Aug 2013 08:01, "Pavanaja U B" pavanaja@vishvakannada.com wrote:

...
I second Tejaswini. Those who are working on Kannada OCR development also say the same.****

Regards,****

Pavanaja****

*From:* wikimediaindia-l-bounces@lists.wikimedia.org [mailto: wikimediaindia-l-bounces@lists.wikimedia.org] *On Behalf Of *Tejaswini Niranjana *Sent:* 21 August 2013 11:24 *To:* Wikimedia India Community list *Subject:* Re: [Wikimediaindia-l] Indic print material digitization workshop query****

Colleagues working in Bangla say that in their experience it is faster, cheaper, and less error-prone to create digital texts by typing them in. Once there is a larger body of digitised texts, and OCR technology for Indian languages also improves, OCR could become the preferred option. ** **

Tejaswini****

On 19 August 2013 22:38, Aarti K. Dwivedi ellydwivedi2093@gmail.com wrote:****

Hi Everyone,****
 In my opinion, it is always better to OCR  the documents. I agree
that it's error prone but there is a****

Google Summer of Code project being done by AnkurIndia whose aim is to improve the quality of OCRs****

for Indian scripts. https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001** **

So, maybe not immediately but in short time, OCR is worth it. I am not aware if any Wikisource in Indian****

languages is as vast as French, English or Italian Wikisource. But we should have it because we have quite****

a lot of text.****

Thank You,****

Aarti****

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur < ashwin.baindur@gmail.com> wrote:****

Whether to OCR or not to OCR is a significant issue! When we OCR a page of text, the resultant is often error-prone, lost formatting, and the correction requires crowd-sourced correction. Many of us know about Project Gutenberg. The site provides plain vanilla etexts. But what most people do not know that one of the very first crowd-sourcing initiatives - "Distributed Proof-readers" provides a huge volunteer community correcting OCR pages of text submitted to Project Gutenberg. In fact, I was a Distributed Proofreader before coming to Wikipedia and that was my first crowd-sourced experience.****

http://www.pgdp.net/c/****

I've also done digitisation in a government archive for five years. We took a conscious decision to OCR the text and allow the uncorrected layer to exist rather than take the pains to correct it. The material was used so infrequently, it made good sense for the end-user to proof-read himself should he desire to do so. So the real challenge in digitisation is not OCR, or rather, not just OCR but the creation of an error-free proof-read text layer behind the pdf/other formatted archive document.****

Ashwin Baindur****

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:****

On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language

OCR-ing -

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of

scanners

...
for creating PDF documents and uploading them to places like the

Internet

...
Archive but the experience or knowledge of OCRs and their success rates

is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal****

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****

-- Warm regards,

Ashwin Baindur ------------------------------------------------------ ****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****

-- ****

Aarti K. Dwivedi****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l****

-- ****

Tejaswini Niranjana, PhD Lead Researcher - Higher Education Innovation and Research Applications (HEIRA) Senior Fellow - Centre for the Study of Culture and Society (CSCS) Visiting Professor - Tata Institute of Social Sciences (TISS)****

Advisor, Access to Knowledge Programme, Centre for Internet and Society Visiting Faculty - Centre for Contemporary Studies, Indian Institute of Science (CCS-IISc)

t: 91-80-41202302 http://heira.in www.cscs.res.in****

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

sankarshan

19 Aug 19 Aug

10:17 p.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara sumanah@wikimedia.org wrote:

...

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

http://www.google-melange.com/gsoc/org/google/gsoc2013/ankur_india would show that two of the projects that are being undertaken in this iteration of GSoC pertain to OCR and IR (information retrieval). Additionally, for those who want to keep themselves updated with the progress in this space, please make sure that you are in touch with the group organizing http://www.isical.ac.in/~fire/

Over the past decade I've heard many esteemed research organizations in India talk about how they have OCR systems which are 80-88% accurate. At a large scale, that accuracy is suitably worthless. Add to this the fact that none of the code bases of those systems are in public domain (even if the original research has been done with public funds) which in turn negates any approach to validate the claims of accuracy or, undertake iterative improvement.

http://www.amazon.com/Guide-OCR-Indic-Scripts-Recognition/dp/1848003293 : Guide to OCR for Indic Scripts: Document Recognition and Retrieval (Advances in Computer Vision and Pattern Recognition) is a volume published in 2009 but it does a good job of summing up the problems in the OCR space pertaining to Indic scripts and, also the (then) state-of-the-art.

OCR and IR are very interesting to talk about (also, great ideas to raise funds for!). I've rarely seen a serious attempt to take the challenges head on (barring Debayan's attempt with Tesseract).

-- sankarshan mukhopadhyay https://twitter.com/#!/sankarshan

Jayanta Nath

21 Aug 21 Aug

1:42 a.m.

New subject: [Wikimediaindia-l] Indic print material digitization workshop query

@Sumana Harihareswara

Please look the Bengali OCR https://code.google.com/p/banglaocr/ and its need to developed.

On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara < sumanah@wikimedia.org> wrote:

...

On 08/19/2013 02:52 AM, L. Shyamal wrote:

...
Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangalo...

...
now that the workshop has already been conducted I think those that have attended the workshop could comment if this cover Indic language OCR-ing

...
if it did it would be worthwhile if the OCR software used can be

documented

...
on the meta pages or elsewhere such as Wikisource. Most of the more experienced editors here will be fairly familiar with the use of scanners for creating PDF documents and uploading them to places like the Internet Archive but the experience or knowledge of OCRs and their success rates

is

...
a bit wanting for Indic languages (fonts).

best wishes Shyamal en:User:Shyamal

I looked at the talk page on Meta - thank you, Shyamal!

For those who do not know: OCR means Optical Character Recognition. When we want to get archival documents onto the web, it's nice to have photos of them, but it's even better to OCR them so that people can clearly read, copy, excerpt, translate, and remix the text.

Is there a central list of the problems that OCR software (especially open source OCR software) has with text written in Indic languages? If so, I could help encourage people to fix those problems, as volunteers, via a Google Summer of Code/Outreach Program for Women internship, via a grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG ), or via some other method.

People who would like to make Wikisource more easily useful for Indic languages might want to contribute to the Wikisource vision development project that's going on right now:

https://wikisource.org/wiki/Wikisource_vision_development

The ProofreadPage extension (part of the Wikisource technology stack) is being worked on right now in Aarti K. Dwivedi's Google Summer of Code internship. http://aartindi.blogspot.in/ She might be interested in knowing about these issues, so I am cc'ing her.

Also - just because people on this list might be interested! - if you have an old historical map that you'd like to vectorize to get it onto OpenStreetMap, try out the new "Map polygon and feature extractor" tool: https://github.com/NYPL/map-vectorizer

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

3985

Age (days ago)

3987

Last active (days ago)

wikimediaindia-l@lists.wikimedia.org

12 comments

10 participants

tags (0)

participants (10)

Aarti K. Dwivedi
Ashwin Baindur
Dhaval S. Vyas
Jayanta Nath
L. Shyamal
Pavanaja U B
sankarshan
Sumana Harihareswara
Tejaswini Niranjana
Vishnu T