Tesseract Open Source OCR Engine - Wikisource-l

21 Apr 2016

Hi,

I don't know where things are with OCR for non-latin scripts, so maybe 
this is not relevant anymore. Last time I grabbed information about it, 
there was limitation with the google service which was a problem namely 
for Indic languages. Well, yesterday we had a contribution day around 
Alsatian and Franconian dialects 
<https://fr.wikipedia.org/wiki/Discussion_Projet:Alsace#Journ.C3.A9e_contributive_alsacien.2Ffrancique_20_avril_2016>

where I had the opportunity to talk with some linguists. One of them 
told me that google was in fact using tesseract 
<https://github.com/tesseract-ocr> for its OCR service, which is open 
source. According to what she told me (or at least what I remember from 
this), it works with a trans-script training machine, you have to define 
matching between picture sample and character and there it goes. Looking 
quickly at the langdata repository I see that there are stuff about 
Devenagari, which I believe is a script used in at least a part of Indic 
texts, isn't it?

Hope that may help,
mathieu