Re: [Wikimediaindia-l] Indic print material digitization workshop query

20 Aug 2013


      On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara
sumanah@wikimedia.org wrote:
...
Is there a central list of the problems that OCR software (especially
open source OCR software) has with text written in Indic languages?  If
so, I could help encourage people to fix those problems, as volunteers,
via a Google Summer of Code/Outreach Program for Women internship, via a
grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG
), or via some other method.
http://www.google-melange.com/gsoc/org/google/gsoc2013/ankur_india
would show that two of the projects that are being undertaken in this
iteration of GSoC pertain to OCR and IR (information retrieval).
Additionally, for those who want to keep themselves updated with the
progress in this space, please make sure that you are in touch with
the group organizing http://www.isical.ac.in/~fire/
Over the past decade I've heard many esteemed research organizations
in India talk about how they have OCR systems which are 80-88%
accurate. At a large scale, that accuracy is suitably worthless. Add
to this the fact that none of the code bases of those systems are in
public domain (even if the original research has been done with public
funds) which in turn negates any approach to validate the claims of
accuracy or, undertake iterative improvement.
http://www.amazon.com/Guide-OCR-Indic-Scripts-Recognition/dp/1848003293
: Guide to OCR for Indic Scripts: Document Recognition and Retrieval
(Advances in Computer Vision and Pattern Recognition) is a volume
published in 2009 but it does a good job of summing up the problems in
the OCR space pertaining to Indic scripts and, also the (then)
state-of-the-art.
OCR and IR are very interesting to talk about (also, great ideas to
raise funds for!). I've rarely seen a serious attempt to take the
challenges head on (barring Debayan's attempt with Tesseract).
/s
-- 
sankarshan mukhopadhyay
https://twitter.com/#!/sankarshan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediaindia-l] Indic print material digitization workshop query