WMF is not programming an OCR-software and it would probably be a bad idea to reinvent the wheel. It would be far better if editors reached out to existing ORC-software projects. Starting a discussion or filing a bug is an important first step in improving the situation.2) OCRThis is something WMF can help with, because they can make the connection between an OCR service and Mediawiki easier and automate certain steps.I think it is important for non-technical readers of this list to separate the 2 issues in the discussion.1) OCR-Integration
Tesseract-OCR (https://github.com/tesseract-ocr) for example is an open-source project that works on OCR (No bugs filed for e.g. Bengali). The mailing list (https://groups.google.com/forum/#!forum/tesseract-ocr) contains discussions about e.g. Bengali (https://groups.google.com/forum/#!searchin/tesseract-ocr/Bengali). So I think the situation might not be good, but is certainly on its way of getting better.Maybe WMF-India can fund a developer to work on Tesseract-OCR. Another idea would be, to reach out to local universities. Maybe a few informatics-students can improve the situation.-Tobias2015-12-01 19:51 GMT+01:00 ViswaPrabha (വിശ്വപ്രഭ) <viswaprabha@gmail.com>:-ViswaprabhaEspecially so, because we are anyway dealing with 'free' input and output material.Until such a tool appears on the horizon, the Google facility is just okay to be used.From that page which, Alex has linked:Hundreds among us have burnt their hands in developing a successful 'free' OCR tool for Indic languages without any real luck until now.
"On the other hand, using the service for converting document formats is SaaSS, because it's something you could have done by running a suitable program (free, one hopes) in your own computer."On 1 December 2015 at 21:49, Bodhisattwa Mandal <bodhisattwa.rgkmc@gmail.com> wrote:Hi Alex,
Of course, building free OCR can be the only permanent solution, but WMF is not interested in building new OCR right now. The language engineering team said at the conference that, they don't have the infrastructure and expertise to build such software. That's why, we have to rely on Google OCR, knowing very well about its profit making intentions. It's just a temporary solution but right now, its the only best possible alternative for us.
Regards
BodhisattwaOn 1 Dec 2015 21:12, "Alex Brollo" <alex.brollo@gmail.com> wrote:... nevertheless I found very interesting this about "SaaSS": https://www.gnu.org/philosophy/who-does-that-server-really-serve.htmlSo, to build a true, excellent and indipendent "wikisource multilingual OCR service" would be a better solution.Alex2015-12-01 16:06 GMT+01:00 Bodhisattwa Mandal <bodhisattwa.rgkmc@gmail.com>:Hi Nemo,
Thanks for your interest. You can find the list of Google OCR supported languages in the following link -
https://support.google.com/drive/answer/176692?hl=en
Regards,
BodhisattwaThanks for posting about the topic. Which indic languages are we talking about exactly? Are they included in the recent FineReader versions now used by Internet Archive?
Nemo
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l