On 07/29/2010 07:02 AM, Alex Brollo wrote:
2010/7/28 Lars Aronsson <lars@aronsson.se mailto:lars@aronsson.se>
Wiktionary can need many things, coverage of common words as well as examples of how to use uncommon words. From the Swedish Wikisource, I extracted the body text and made a word frequency list,
This is very interesting. Can you tell us more details about? has been the job documented (in English, Swedish is "a little difficult" for me...) somewhere? I can produce lists by my rought script, but it works on raw wiki code and the result is "dirty" - it contains markup words, and obviously all wrong words too (seaching for wrong words was my fisrt aim...). Did you work on html dump perhaps?
My code for extracting the body text from the XML dumps has not been published. But Erik Zachte has published his code for extracting "readable text", and maybe you can use that. See http://stats.wikimedia.org/scripts.zip It's only a lot of regular expressions and substitutions.
After the body text has been extracted, you can either fold case (so Madrid becomes madrid) or not, you can either remove interpunctiation (so e.g. becomes e g) or not, depending on how you want to treat proper names and abbreviations. I use simple "sed" expressions for this. If you don't fold case and don't remove interpunctuation, you will get a lot of false entries where sentences meet, e.g. both "this." and "this", both "after" and "After".