Re: [Wikimediaindia-l] Indic print material digitization workshop query

19 Aug 2013

Hi Everyone,

     In my opinion, it is always better to OCR  the documents. I agree that
it's error prone but there is a
Google Summer of Code project being done by AnkurIndia whose aim is to
improve the quality of OCRs
for Indian scripts.
https://www.google-melange.com/gsoc/project/google/gsoc2013/knoxxs/5001

So, maybe not immediately but in short time, OCR is worth it. I am not
aware if any Wikisource in Indian
languages is as vast as French, English or Italian Wikisource. But we
should have it because we have quite
a lot of text.

Thank You,
Aarti

On Mon, Aug 19, 2013 at 10:28 PM, Ashwin Baindur
&lt;ashwin.baindur(a)gmail.com&gt;wrote;wrote:

...
  Whether to OCR or not to OCR is a significant issue!
When we OCR a page of
 text, the resultant is often error-prone, lost formatting, and the
 correction requires crowd-sourced correction. Many of us know about Project
 Gutenberg. The site provides plain vanilla etexts. But what most people do
 not know that one of the very first crowd-sourcing initiatives -
 "Distributed Proof-readers" provides a huge volunteer community correcting
 OCR pages of text submitted to Project Gutenberg. In fact, I was a
 Distributed Proofreader before coming to Wikipedia and that was my first
 crowd-sourced experience.

 http://www.pgdp.net/c/

 I've also done digitisation in a government archive for five years. We
 took a conscious decision to OCR the text and allow the uncorrected layer
 to exist rather than take the pains to correct it. The material was used so
 infrequently, it made good sense for the end-user to proof-read himself
 should he desire to do so. So the real challenge in digitisation is not
 OCR, or rather, not just OCR but the creation of an error-free proof-read
 text layer behind the pdf/other formatted archive document.

 Ashwin Baindur

 On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara <
 sumanah(a)wikimedia.org&gt; wrote:

  On 08/19/2013 02:52 AM, L. Shyamal wrote:
  Re-posting a now outdated query from meta

http://meta.wikimedia.org/wiki/Talk:India_Access_To_Knowledge/Events/Bangal…

 now that the workshop has already been conducted I think those that have
 attended the workshop could comment if this cover Indic language  OCR-ing -
  if it did it would be worthwhile if the OCR
software used can be  documented
  on the meta pages or elsewhere such as
Wikisource. Most of the more
 experienced editors here will be fairly familiar with the use of  scanners
  for creating PDF documents and uploading them to
places like the  Internet
  Archive but the experience or knowledge of OCRs
and their success rates  is
  a bit wanting for Indic languages (fonts).

 best wishes
 Shyamal
 en:User:Shyamal 
 I looked at the talk page on Meta - thank you, Shyamal!

 For those who do not know: OCR means Optical Character Recognition.
 When we want to get archival documents onto the web, it's nice to have
 photos of them, but it's even better to OCR them so that people can
 clearly read, copy, excerpt, translate, and remix the text.

 Is there a central list of the problems that OCR software (especially
 open source OCR software) has with text written in Indic languages?  If
 so, I could help encourage people to fix those problems, as volunteers,
 via a Google Summer of Code/Outreach Program for Women internship, via a
 grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG
 ), or via some other method.

 People who would like to make Wikisource more easily useful for Indic
 languages might want to contribute to the Wikisource vision development
 project that's going on right now:

 https://wikisource.org/wiki/Wikisource_vision_development

 The ProofreadPage extension (part of the Wikisource technology stack) is
 being worked on right now in Aarti K. Dwivedi's Google Summer of Code
 internship.  http://aartindi.blogspot.in/  She might be interested in
 knowing about these issues, so I am cc'ing her.

 Also - just because people on this list might be interested! - if you
 have an old historical map that you'd like to vectorize to get it onto
 OpenStreetMap, try out the new "Map polygon and feature extractor" tool:
 https://github.com/NYPL/map-vectorizer

 --
 Sumana Harihareswara
 Engineering Community Manager
 Wikimedia Foundation

 _______________________________________________
 Wikimediaindia-l mailing list
 Wikimediaindia-l(a)lists.wikimedia.org
 To unsubscribe from the list / change mailing preferences visit
 https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

 --
 Warm regards,

 Ashwin Baindur
 ------------------------------------------------------

 _______________________________________________
 Wikimediaindia-l mailing list
 Wikimediaindia-l(a)lists.wikimedia.org
 To unsubscribe from the list / change mailing preferences visit
 https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- 
Aarti K. Dwivedi

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediaindia-l] Indic print material digitization workshop query