The request is to create a web-based text corpus from which to derive
frequencies and then compare with existing wiktionaries. Not a light
undertaking, but one which has been proposed and implemented previously
(e.g. Connel's Gutenberg project)
Generically speaking, someone would need to determine the appropriate
size of the corpus sample, it's temporal currency, and the method of
creating and maintaining it. This isn't easy to do, and having no
strictures results in unwieldy and mostly irrelevant products like
Google's n-grams (on the other hand, if someone can figure out how to
filter n-grams usefully it would mean we don't have to build our own.)
On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
Here is what I would like to do : generating
reports which give, for
a given language, a list of words which are used on the web with a
number evaluating its occurencies, but which are not in a given
How would you recommand to implemente that within the wikimedia
Some years back, I undertook to add entries for
Swedish words in the English Wiktionary. You can
follow my diary at http://en.wiktionary.org/wiki/User:LA2
Among the things I did was to extract a list of all
Swedish words that already had entries. The best
way was to use CatScan to list entries in categories
for Swedish words. Even if there is a page called
"men", this doesn't mean the Swedish word "men"
has an entry, because it could be the English word
"men" that is in that page.
Then I extracted all words from some known texts,
e.g. novels, the Bible, government reports, and the
Swedish Wikipedia, counting the number of
occurrencies of each word. Case significance is
a bit tricky. There should not be an entry for
lower-case stockholm, so you can't just convert
everything to lower case. But if a sentence begins
with a capital letter, that word should not have
a capitalized entry. Another tricky issue is
abbreviations, which should keep the period,
for example "i.e." rather than "i" and "e". But
the period that ends a sentence should be removed.
When splitting a text into words, I decided to keep
all periods and initial capital letters, even if this
leads to some false words.
When you have word frequency statistics for a text,
and a list of existing entries from Wiktionary, you
can compute the coverage, and I wrote a little
script for this. I found that English Wiktionary already
had Swedish entries covering 72% of the words in the
Bible, and when I started to add entries for the most
common of the missing words, I was able to increase
this to 87% in just a single month (September 2010).
Many of the common words that were missing when
I started were adverbs such as "thereof", "herein",
which occur frequently in any text but are not very
exciting to write entries about. This statistics-based
approach gave me a reason to add those entries.
It is interesting to contrast a given text to a given
dictionary in this way. The Swedish entries in the
English Wiktionary is a different dictionary than the
Swedish entries in the German or Danish Wiktionary.
The kinds of words found in the Bible are different
from those found in Wikipedia or in legal texts.
There is not a single, universal text corpus that we
can aim to cover. Google has released its ngram
dataset. I'm not sure if it covers Swedish, but even
if it does, it must differ from the corpus frequencies
published by the Swedish Academy.
It is relatively easy to extract a list of existing entries
from Wiktionary. But to prepare a given text corpus
for frequency and coverage analysis needs more