On 14 December 2010 14:28, Andrew Dunbar <hippytrail(a)gmail.com> wrote:
The dump site (
http://download.wikimedia.org/) is
still broken at the
moment but another way to build some word frequency data is by
randomly sampling the wikis for the languages you are interested in.
At least these Indic languages have Wikipedias of varying sizes:
Assamese
http://as.wikipedia.org
Bihari
http://bh.wikipedia.org
Bengali
http://bn.wikipedia.org
Bishnupriya Manipuri
http://bpy.wikipedia.org
Gujarati
http://gu.wikipedia.org
Hindi
http://hi.wikipedia.org
Kannada
http://kn.wikipedia.org
Kashmiri
http://ks.wikipedia.org
Marathi
http://mr.wikipedia.org
Nepali
http://ne.wikipedia.org
Nepal Bhasa
http://new.wikipedia.org
Oriya
http://or.wikipedia.org/wiki
Eastern Punjabi
http://pa.wikipedia.org
Western Punjabi
http://pnb.wikipedia.org
Sanskrit
http://sa.wikipedia.org
Sindhi
http://sd.wikipedia.org
Tamil
http://ta.wikipedia.org
Telugu
http://te.wikipedia.org
Urdu
http://ur.wikipedia.org
If you'd like to use it I have a tool that downloads random samples of
wiki pages and strips the HTML for purposes such as this.
Yeah, let me know, that will be very useful
Thanks,
Pravin s