On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara
<sumanah(a)wikimedia.org> wrote:
Is there a central list of the problems that OCR
software (especially
open source OCR software) has with text written in Indic languages? If
so, I could help encourage people to fix those problems, as volunteers,
via a Google Summer of Code/Outreach Program for Women internship, via a
grant-funded project (such as
https://meta.wikimedia.org/wiki/Grants:IEG
), or via some other method.
<http://www.google-melange.com/gsoc/org/google/gsoc2013/ankur_india>
would show that two of the projects that are being undertaken in this
iteration of GSoC pertain to OCR and IR (information retrieval).
Additionally, for those who want to keep themselves updated with the
progress in this space, please make sure that you are in touch with
the group organizing <http://www.isical.ac.in/~fire/>
Over the past decade I've heard many esteemed research organizations
in India talk about how they have OCR systems which are 80-88%
accurate. At a large scale, that accuracy is suitably worthless. Add
to this the fact that none of the code bases of those systems are in
public domain (even if the original research has been done with public
funds) which in turn negates any approach to validate the claims of
accuracy or, undertake iterative improvement.
<http://www.amazon.com/Guide-OCR-Indic-Scripts-Recognition/dp/1848003293>
: Guide to OCR for Indic Scripts: Document Recognition and Retrieval
(Advances in Computer Vision and Pattern Recognition) is a volume
published in 2009 but it does a good job of summing up the problems in
the OCR space pertaining to Indic scripts and, also the (then)
state-of-the-art.
OCR and IR are very interesting to talk about (also, great ideas to
raise funds for!). I've rarely seen a serious attempt to take the
challenges head on (barring Debayan's attempt with Tesseract).
/s
--
sankarshan mukhopadhyay
<https://twitter.com/#!/sankarshan>