Re: [Wikitech-l] require language dump for developing words and corresponding frequency

14 Dec 2010

On 14 December 2010 14:28, Andrew Dunbar &lt;hippytrail(a)gmail.com&gt; wrote:

...
  The dump site (http://download.wikimedia.org/) is
still broken at the
 moment but another way to build some word frequency data is by
 randomly sampling the wikis for the languages you are interested in.
 At least these Indic languages have Wikipedias of varying sizes:

 Assamese http://as.wikipedia.org
 Bihari http://bh.wikipedia.org
 Bengali http://bn.wikipedia.org
 Bishnupriya Manipuri http://bpy.wikipedia.org
 Gujarati http://gu.wikipedia.org
 Hindi http://hi.wikipedia.org
 Kannada http://kn.wikipedia.org
 Kashmiri http://ks.wikipedia.org
 Marathi http://mr.wikipedia.org
 Nepali http://ne.wikipedia.org
 Nepal Bhasa http://new.wikipedia.org
 Oriya http://or.wikipedia.org/wiki
 Eastern Punjabi http://pa.wikipedia.org
 Western Punjabi http://pnb.wikipedia.org
 Sanskrit http://sa.wikipedia.org
 Sindhi  http://sd.wikipedia.org
 Tamil http://ta.wikipedia.org
 Telugu http://te.wikipedia.org
 Urdu http://ur.wikipedia.org

 If you'd like to use it I have a tool that downloads random samples of
 wiki pages and strips the HTML for purposes such as this.

Yeah, let me know, that will be very useful

Thanks,
Pravin s

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] require language dump for developing words and corresponding frequency