On 07/29/2010 07:02 AM, Alex Brollo wrote:
2010/7/28 Lars Aronsson <lars(a)aronsson.se
<mailto:lars@aronsson.se>>
Wiktionary can need many things, coverage of common
words as well as examples of how to use uncommon words.
From the Swedish Wikisource, I extracted the body text and
made a word frequency list,
This is very interesting. Can you tell us more details about? has been
the job documented (in English, Swedish is "a little difficult" for
me...) somewhere? I can produce lists by my rought script, but it
works on raw wiki code and the result is "dirty" - it contains markup
words, and obviously all wrong words too (seaching for wrong words was
my fisrt aim...). Did you work on html dump perhaps?
My code for extracting the body text from the XML dumps
has not been published. But Erik Zachte has published his
code for extracting "readable text", and maybe you can use that.
See
http://stats.wikimedia.org/scripts.zip
It's only a lot of regular expressions and substitutions.
After the body text has been extracted, you can either fold
case (so Madrid becomes madrid) or not, you can either
remove interpunctiation (so e.g. becomes e g) or not,
depending on how you want to treat proper names and
abbreviations. I use simple "sed" expressions for this.
If you don't fold case and don't remove interpunctuation,
you will get a lot of false entries where sentences meet,
e.g. both "this." and "this", both "after" and
"After".
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se