Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries

31 Jul 2013


      On 30/07/13 08:15, Mathieu Stumpf wrote:
...
Actually, I think it would be interesting to have a trend history of
words usage over centuries (current trend would also be interesting
but probably harder to implement). Wikisource may be used in order to
achieve that.
Not really. Or, more fairly, the available texts are probably not a
valid sample though they could be used for an informal guideline.
"Full documentation: The sobering examples of the research
    experiences of Timberlake and Ruppenhofer (mentiolned above) show
    that even 100,000,000 words is at least an order of magnitude too
    small to capture phenomena that, though of low frequency, are in the
    competence of ordinary native speakers. That would represent at
    least 20,000 recorded hours, and it is too low by an order of
    magnitude."[1]
Of course this is referencing spoken language which, in most cases,
differs significantly from written language, but a running word corpus
of 100,000,000 seems a useful target, with samples weighted between
transcripts, periodicals, and texts from a delimited time and region.
Lemmatized corpus of 6,000-10,000.
Amgine
[1] http://emeld.org/school/classroom/text/lexicon-size.html

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries