On 07/30/2013 07:17 PM, Amgine wrote:
Of course this is referencing spoken language which, in most cases, differs significantly from written language, but a running word corpus of 100,000,000 seems a useful target, with samples weighted between transcripts, periodicals, and texts from a delimited time and region. Lemmatized corpus of 6,000-10,000.
If you want to compare one year or decade to the next, you need a similar sample from both years. One way to get this is to narrow down to a corpus of just one journal or newspaper. Wikisource can do this with Popular Science Monthly, https://en.wikisource.org/wiki/PSM
You'll get popular science and only that for every year. You won't have romantic poetry for one year, and theological texts for the next year. You can spot trends in the use of words like engine/motor or steam/electricity, just because that is what this journal is about, and you get the same number of issues and pages each year.
Some assembly required: Most volumes of PSM are not complete yet. Lots of proofreading remains.