Hello, I have access to huge resources of old books in Persian (some of them are even typed) and almost all of them can be imported to Wikisource but the problem is I don't have (or know) any OCR for Persian, Do you know which OCR software supports Persian (supporting Arabic is not enough; I checked several programs) texts?
Best
Amir Ladsgroup, 24/06/2014 15:37:
I have access to huge resources of old books in Persian (some of them are even typed) and almost all of them can be imported to Wikisource but the problem is I don't have (or know) any OCR for Persian, Do you know which OCR software supports Persian (supporting Arabic is not enough; I checked several programs) texts?
The only result for "Persian" and OCR in abbyy website is http://www.abbyy.com/CaseStudies/SISU-Reveals-Its-Multilingual-Content-to-Academic-Community-Thanks-to-ABBYY-Recognition-Server/, weird! Worth asking them some details, they might have some additional plugins.
On the FLOSS side, maybe some library in Iran made some investments on tesseract? If there's any big digital library of Persian content you should ask them as well.
Reminder: archive.org is still in need of people willing to compare 8.0 vs. 9.0 OCR results of some books in their language. :) http://thread.gmane.org/gmane.org.wikimedia.wikisource/1552
Nemo
ABBYY FineReader supports Hebrew and Arabic since v. 11. But I'm afraid same script is not enough. For example FineReader has 3 versions for Armenian. All three use same scripts, different orphography and slightly different vocabulary, but if you set wrong language drop in quality is dramatic. So I'm not sure if Arabic OCR would work good for text in Farsi (Persian). FineReader provides 30 days full trial, and I think it's worth to give it a try.
You may try to approach ABBYY and check if there are any plans on full support of Persian in coming future.
And trying to train Teseract seems like good idea to get free/open source OCR for Persian, if you can get enough resources on that. But I can't comment on how well it will work with RTL scripts especially with Nastaliq/Naskh when letters and words are not separated from each other.
On Tue, Jun 24, 2014 at 6:13 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Amir Ladsgroup, 24/06/2014 15:37:
I have access to huge resources of old books in Persian (some of them
are even typed) and almost all of them can be imported to Wikisource but the problem is I don't have (or know) any OCR for Persian, Do you know which OCR software supports Persian (supporting Arabic is not enough; I checked several programs) texts?
The only result for "Persian" and OCR in abbyy website is < http://www.abbyy.com/CaseStudies/SISU-Reveals-Its-Multilingual-Content-to- Academic-Community-Thanks-to-ABBYY-Recognition-Server/>, weird! Worth asking them some details, they might have some additional plugins.
On the FLOSS side, maybe some library in Iran made some investments on tesseract? If there's any big digital library of Persian content you should ask them as well.
Reminder: archive.org is still in need of people willing to compare 8.0 vs. 9.0 OCR results of some books in their language. :) http://thread.gmane.org/gmane.org.wikimedia.wikisource/1552
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
I tried ABBY before and the quality was low, I will try tesseract and see what happens
Best
On Tue, Jun 24, 2014 at 7:08 PM, Aleksey Chalabyan xelgen.am@gmail.com wrote:
ABBYY FineReader supports Hebrew and Arabic since v. 11. But I'm afraid same script is not enough. For example FineReader has 3 versions for Armenian. All three use same scripts, different orphography and slightly different vocabulary, but if you set wrong language drop in quality is dramatic. So I'm not sure if Arabic OCR would work good for text in Farsi (Persian). FineReader provides 30 days full trial, and I think it's worth to give it a try.
You may try to approach ABBYY and check if there are any plans on full support of Persian in coming future.
And trying to train Teseract seems like good idea to get free/open source OCR for Persian, if you can get enough resources on that. But I can't comment on how well it will work with RTL scripts especially with Nastaliq/Naskh when letters and words are not separated from each other.
On Tue, Jun 24, 2014 at 6:13 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Amir Ladsgroup, 24/06/2014 15:37:
I have access to huge resources of old books in Persian (some of them
are even typed) and almost all of them can be imported to Wikisource but the problem is I don't have (or know) any OCR for Persian, Do you know which OCR software supports Persian (supporting Arabic is not enough; I checked several programs) texts?
The only result for "Persian" and OCR in abbyy website is < http://www.abbyy.com/CaseStudies/SISU-Reveals-Its- Multilingual-Content-to-Academic-Community-Thanks-to- ABBYY-Recognition-Server/>, weird! Worth asking them some details, they might have some additional plugins.
On the FLOSS side, maybe some library in Iran made some investments on tesseract? If there's any big digital library of Persian content you should ask them as well.
Reminder: archive.org is still in need of people willing to compare 8.0 vs. 9.0 OCR results of some books in their language. :) http://thread.gmane.org/gmane.org.wikimedia.wikisource/1552
Nemo
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Hi,
I have Abby FR 11 Professional Edition, and Persian/Farsi is not among the supported languages. :(
Yann
2014-06-24 19:07 GMT+05:30 Amir Ladsgroup ladsgroup@gmail.com:
Hello, I have access to huge resources of old books in Persian (some of them are even typed) and almost all of them can be imported to Wikisource but the problem is I don't have (or know) any OCR for Persian, Do you know which OCR software supports Persian (supporting Arabic is not enough; I checked several programs) texts?
Best
-- Amir
Is there any Persian/Parsi text into Internet Archive? I'd like to take a look to its OCR - just to see if OCR engine attempts to interpret it (even if with no usable result).
Alex
2014-06-27 9:14 GMT+02:00 Yann Forget yannfo@gmail.com:
Hi,
I have Abby FR 11 Professional Edition, and Persian/Farsi is not among the supported languages. :(
Yann
2014-06-24 19:07 GMT+05:30 Amir Ladsgroup ladsgroup@gmail.com:
Hello, I have access to huge resources of old books in Persian (some of them are even typed) and almost all of them can be imported to Wikisource but the problem is I don't have (or know) any OCR for Persian, Do you know which
OCR
software supports Persian (supporting Arabic is not enough; I checked several programs) texts?
Best
-- Amir
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
wikisource-l@lists.wikimedia.org