Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries

29 Jul 2013

On 07/26/2013 08:26 PM, Amgine wrote:
...
  Google's n-grams[3] (on the other hand, if someone
can figure out how to
 filter n-grams usefully it would mean we don't have to build our own.) 
Exactly. And nothing stops us from going both ways,
compare the results and let the best frequency list win.
If it was a good idea to arrive at the one and true list,
then linguists would have done so long ago.

Since the 1960s, Gothenburg University collects word
frequencies for Swedish based on newspaper text,
where the text is copyrighted but the frequency lists
are made openly available,
http://spraakbanken.gu.se/pub/statistik/

I'm sure you can find similar resources for many other
languages.

What WMF could do is to compile its own frequency lists
based on Wikipedia and Wikisource, and publish them
at regular intervals (annually?) along with XML dumps.

-- 
   Lars Aronsson (lars(a)aronsson.se)
   Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries