On 07/26/2013 08:26 PM, Amgine wrote:
Google's n-grams[3] (on the other hand, if someone can figure out how to filter n-grams usefully it would mean we don't have to build our own.)
Exactly. And nothing stops us from going both ways, compare the results and let the best frequency list win. If it was a good idea to arrive at the one and true list, then linguists would have done so long ago.
Since the 1960s, Gothenburg University collects word frequencies for Swedish based on newspaper text, where the text is copyrighted but the frequency lists are made openly available, http://spraakbanken.gu.se/pub/statistik/
I'm sure you can find similar resources for many other languages.
What WMF could do is to compile its own frequency lists based on Wikipedia and Wikisource, and publish them at regular intervals (annually?) along with XML dumps.