On 07/26/2013 08:26 PM, Amgine wrote:
Google's n-grams[3] (on the other hand, if someone
can figure out how to
filter n-grams usefully it would mean we don't have to build our own.)
Exactly. And nothing stops us from going both ways,
compare the results and let the best frequency list win.
If it was a good idea to arrive at the one and true list,
then linguists would have done so long ago.
Since the 1960s, Gothenburg University collects word
frequencies for Swedish based on newspaper text,
where the text is copyrighted but the frequency lists
are made openly available,
http://spraakbanken.gu.se/pub/statistik/
I'm sure you can find similar resources for many other
languages.
What WMF could do is to compile its own frequency lists
based on Wikipedia and Wikisource, and publish them
at regular intervals (annually?) along with XML dumps.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se