Dear Wikimedia India, As you probably aware the Govt. of India, immediately post Independence started multiple Indian language encyclopedia projects to stream in Science and Technology. The Tamil language encyclopedia was completed [http://en.wikipedia.org/wiki/Tamil_Encyclopedia] I'm pleased to report Tamil Virtual University has scanned in the Tamil Kalaikalanjiam / Tamil Encyclopedia [Please see Reference 1 below]. I was able to download the material via the wonderful wget command and the 'convert' (from imagemagick lib) in GNU/Linux. However each of the 10 volumes is close to 700 MB without compression. I would imagine, the people behind this mammoth task (pre-internet era) would have liked it to be merged into a Wiki type format, which would make it a truly living document in-sync with the times. I do not have any experience with 1) Tamil OCR software and 2) Automated updates to Wikipedia. Can anyone take the lead on this project ? It will help boost the number of quality, articles in Indian languages. The Children's encyclopedia is being scanned and has a lot of great visual content. I have uploaded a sample (10 MB) PDF file at https://sites.google.com/site/periasamythooran/kalaikalanjiam/kalaikalanjiam... if you are interested to give it a spin. Thanks, Murali. 1. http://www.tamilvu.org/library/libindex.htm and click on Kalaikalanjiam / Tamil Encyclopedia.
Dear Murali,
Thanks for your efforts. Natkeeran, an active Tamil Wikipedian, has also downloaded the entire encyclopaedia. He is also stalled at the OCR stage. There are some Tamil OCR projects at IISc Mile lab under Prof.A.G.Ramakrishnan and some other government-supported labs. However, I'm yet to see one that is publicly available and ready to use. That said, we're still working to get even an alpha version for this purpose.
I would welcome you to connect with Natkeeran (copied) to take this further.
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture
From: Murali Kumar pthooran@hotmail.com To: wikimediaindia-l@lists.wikimedia.org Sent: Mon, November 15, 2010 11:33:22 AM Subject: [Wikimediaindia-l] Tamil Encyclopedia merge into Wikipedia.
Dear Wikimedia India,
As you probably aware the Govt. of India, immediately post Independence started multiple Indian language encyclopedia projects to stream in Science and Technology. The Tamil language encyclopedia was completed [http://en.wikipedia.org/wiki/Tamil_Encyclopedia]
I'm pleased to report Tamil Virtual University has scanned in the Tamil Kalaikalanjiam / Tamil Encyclopedia [Please see Reference 1 below].
I was able to download the material via the wonderful wget command and the 'convert' (from imagemagick lib) in GNU/Linux. However each of the 10 volumes is close to 700 MB without compression.
I would imagine, the people behind this mammoth task (pre-internet era) would have liked it to be merged into a Wiki type format, which would make it a truly living document in-sync with the times.
I do not have any experience with 1) Tamil OCR software and 2) Automated updates to Wikipedia.
Can anyone take the lead on this project ? It will help boost the number of quality, articles in Indian languages. The Children's encyclopedia is being scanned and has a lot of great visual content.
I have uploaded a sample (10 MB) PDF file at https://sites.google.com/site/periasamythooran/kalaikalanjiam/kalaikalanjiam... if you are interested to give it a spin.
Thanks,
Murali.
- http://www.tamilvu.org/library/libindex.htm and click on Kalaikalanjiam /
Tamil Encyclopedia.
I have a query.
What is the license of Tamil Kalaikalanjiam? Did Tamil Nadu government or Tamil Virtual University had officially announced that this Encyclopedia is released in Public Domain or in some creative commons license so that we can reuse the content. If yes, we can very well reuse the content. Otherwise it will be copyright violation. So kindly verify this.
Let us not assume that since it is published by Government it will be in pubic domain. In India that is not the case.
In 2008 December, Kerala Government has officially announced that it is changing the license of similar encyclopedic project in Malayalam (sarvavijanakosam) to Free documentation licensehttp://www.gnu.org/copyleft/fdl.htmlso that Malayalam wiki community can reuse its content to develop Malayalam wikipedia. Governmant has officially announced it. Kerala Government has also set up its own wiki (to help us) for Sarvavijanakosamhttp://en.wikipedia.org/wiki/Sarvavijnanakosamand they are slowly digitizing the content and posting in its own wiki ( http://mal.sarva.gov.in). They have completed some 2,900 articles now. We are reusing this content to enhance many of the existing articles. But we are not copy pasting the entire content due to various reasons. The main reason is, the content need to rewritten as per the style of wikipedia.
I really have doubt about the efficiency of current OCR softwares for Indian languages. It is still under development. The existing solutions are not good. I am not sure about Tamil OCR softwares.
Shiju Alex
On Mon, Nov 15, 2010 at 11:33 AM, Murali Kumar pthooran@hotmail.com wrote:
Dear Wikimedia India,
As you probably aware the Govt. of India, immediately post Independence started multiple Indian language encyclopedia projects to stream in Science and Technology. The Tamil language encyclopedia was completed [ http://en.wikipedia.org/wiki/Tamil_Encyclopedia]
I'm pleased to report Tamil Virtual University has scanned in the Tamil Kalaikalanjiam / Tamil Encyclopedia [Please see Reference 1 below].
I was able to download the material via the wonderful wget command and the 'convert' (from imagemagick lib) in GNU/Linux. However each of the 10 volumes is close to 700 MB without compression.
I would imagine, the people behind this mammoth task (pre-internet era) would have liked it to be merged into a Wiki type format, which would make it a truly living document in-sync with the times.
I do not have any experience with 1) Tamil OCR software and 2) Automated updates to Wikipedia.
Can anyone take the lead on this project ? It will help boost the number of quality, articles in Indian languages. The Children's encyclopedia is being scanned and has a lot of great visual content.
I have uploaded a sample (10 MB) PDF file at https://sites.google.com/site/periasamythooran/kalaikalanjiam/kalaikalanjiam... you are interested to give it a spin.
Thanks,
Murali.
- http://www.tamilvu.org/library/libindex.htm and click on Kalaikalanjiam
/ Tamil Encyclopedia.
Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
On Mon, Nov 15, 2010 at 5:32 PM, Shiju Alex shijualexonline@gmail.com wrote:
I have a query.
What is the license of Tamil Kalaikalanjiam? Did Tamil Nadu government or Tamil Virtual University had officially announced that this Encyclopedia is released in Public Domain or in some creative commons license so that we can reuse the content. If yes, we can very well reuse the content. Otherwise it will be copyright violation. So kindly verify this.
Let us not assume that since it is published by Government it will be in pubic domain. In India that is not the case.
If it is released under a free license, it can be transcribed on Tamil Wikisource.
http://ta.wikisource.org/wiki/
The software to do transcriptions on Wikisource is not enabled on Tamil Wikisource, but it can be enabled once these messages have been translated.
http://translatewiki.net/w/i.php?title=Special%3ATranslate&task=untransl...
Here is an example of the Wikisource transcription software with a completed work:
http://en.wikisource.org/wiki/Index:Sanskrit_Grammar_by_Whitney_p1.djvu
and an incomplete project:
http://en.wikisource.org/wiki/Index:Dictionary_of_National_Biography_volume_...
In 2008 December, Kerala Government has officially announced that it is changing the license of similar encyclopedic project in Malayalam (sarvavijanakosam) to Free documentation license so that Malayalam wiki community can reuse its content to develop Malayalam wikipedia. Governmant has officially announced it. Kerala Government has also set up its own wiki (to help us) for Sarvavijanakosam and they are slowly digitizing the content and posting in its own wiki (http://mal.sarva.gov.in). They have completed some 2,900 articles now. We are reusing this content to enhance many of the existing articles. But we are not copy pasting the entire content due to various reasons. The main reason is, the content need to rewritten as per the style of wikipedia.
The wikisource transcription software is a mediawiki extension, so it could be added to mal.sarva.gov.in.
-- John Vandenberg
Yes, we need to get it under a suitable license. If the technical issue related to OCR is resolved, we can talk to them about releasing the content into public domain.
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture
From: Shiju Alex shijualexonline@gmail.com To: wikimediaindia-l@lists.wikimedia.org Sent: Mon, November 15, 2010 12:02:42 PM Subject: Re: [Wikimediaindia-l] Tamil Encyclopedia merge into Wikipedia.
I have a query.
What is the license of Tamil Kalaikalanjiam? Did Tamil Nadu government or Tamil Virtual University had officially announced that this Encyclopedia is released in Public Domain or in some creative commons license so that we can reuse the content. If yes, we can very well reuse the content. Otherwise it will be copyright violation. So kindly verify this.
Let us not assume that since it is published by Government it will be in pubic domain. In India that is not the case.
In 2008 December, Kerala Government has officially announced that it is changing the license of similar encyclopedic project in Malayalam (sarvavijanakosam) to Free documentation license so that Malayalam wiki community can reuse its content to develop Malayalam wikipedia. Governmant has officially announced it. Kerala Government has also set up its own wiki (to help us) for Sarvavijanakosamand they are slowly digitizing the content and posting in its own wiki (http://mal.sarva.gov.in). They have completed some 2,900 articles now. We are reusing this content to enhance many of the existing articles. But we are not copy pasting the entire content due to various reasons. The main reason is, the content need to rewritten as per the style of wikipedia.
I really have doubt about the efficiency of current OCR softwares for Indian languages. It is still under development. The existing solutions are not good. I am not sure about Tamil OCR softwares.
Shiju Alex
On Mon, Nov 15, 2010 at 11:33 AM, Murali Kumar pthooran@hotmail.com wrote:
Dear Wikimedia India,
As you probably aware the Govt. of India, immediately post Independence started multiple Indian language encyclopedia projects to stream in Science and Technology. The Tamil language encyclopedia was completed [http://en.wikipedia.org/wiki/Tamil_Encyclopedia]
I'm pleased to report Tamil Virtual University has scanned in the Tamil Kalaikalanjiam / Tamil Encyclopedia [Please see Reference 1 below].
I was able to download the material via the wonderful wget command and the 'convert' (from imagemagick lib) in GNU/Linux. However each of the 10 volumes is close to 700 MB without compression.
I would imagine, the people behind this mammoth task (pre-internet era) would have liked it to be merged into a Wiki type format, which would make it a truly living document in-sync with the times.
I do not have any experience with 1) Tamil OCR software and 2) Automated updates to Wikipedia.
Can anyone take the lead on this project ? It will help boost the number of quality, articles in Indian languages. The Children's encyclopedia is being scanned and has a lot of great visual content.
I have uploaded a sample (10 MB) PDF file at https://sites.google.com/site/periasamythooran/kalaikalanjiam/kalaikalanjiam... if you are interested to give it a spin.
Thanks,
Murali.
- http://www.tamilvu.org/library/libindex.htm and click on Kalaikalanjiam /
Tamil Encyclopedia. _______________________________________________ Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
wikimediaindia-l@lists.wikimedia.org