On 30/07/13 08:15, Mathieu Stumpf wrote:
Actually, I think it would be interesting to have a trend history of words usage over centuries (current trend would also be interesting but probably harder to implement). Wikisource may be used in order to achieve that.
Not really. Or, more fairly, the available texts are probably not a valid sample though they could be used for an informal guideline.
"Full documentation: The sobering examples of the research experiences of Timberlake and Ruppenhofer (mentiolned above) show that even 100,000,000 words is at least an order of magnitude too small to capture phenomena that, though of low frequency, are in the competence of ordinary native speakers. That would represent at least 20,000 recorded hours, and it is too low by an order of magnitude."[1]
Of course this is referencing spoken language which, in most cases, differs significantly from written language, but a running word corpus of 100,000,000 seems a useful target, with samples weighted between transcripts, periodicals, and texts from a delimited time and region. Lemmatized corpus of 6,000-10,000.
Amgine
[1] http://emeld.org/school/classroom/text/lexicon-size.html