The dump site (http://download.wikimedia.org/) is still broken at the moment but another way to build some word frequency data is by randomly sampling the wikis for the languages you are interested in. At least these Indic languages have Wikipedias of varying sizes:
Assamese http://as.wikipedia.org Bihari http://bh.wikipedia.org Bengali http://bn.wikipedia.org Bishnupriya Manipuri http://bpy.wikipedia.org Gujarati http://gu.wikipedia.org Hindi http://hi.wikipedia.org Kannada http://kn.wikipedia.org Kashmiri http://ks.wikipedia.org Marathi http://mr.wikipedia.org Nepali http://ne.wikipedia.org Nepal Bhasa http://new.wikipedia.org Oriya http://or.wikipedia.org/wiki Eastern Punjabi http://pa.wikipedia.org Western Punjabi http://pnb.wikipedia.org Sanskrit http://sa.wikipedia.org Sindhi http://sd.wikipedia.org Tamil http://ta.wikipedia.org Telugu http://te.wikipedia.org Urdu http://ur.wikipedia.org
If you'd like to use it I have a tool that downloads random samples of wiki pages and strips the HTML for purposes such as this.
Good luck!
Andrew Dunbar (hippietrail)
On 14 December 2010 18:36, pravin.d.s@gmail.com pravin.d.s@gmail.com wrote:
Hi All,
I am Pravin Satpute, I am working on language technology and for building words and it frequency, i required some webpages in indic language.
Can i get the most recent dump without en.wiki
Thanks, Pravin s _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l