Re: [Wikisource-l] Goals for Wikisource

29 Jul 2010


      On 07/29/2010 07:02 AM, Alex Brollo wrote:
...
2010/7/28 Lars Aronsson <lars@aronsson.se mailto:lars@aronsson.se>
Wiktionary can need many things, coverage of common
words as well as examples of how to use uncommon words.

 From the Swedish Wikisource, I extracted the body text and
made a word frequency list,


This is very interesting. Can you tell us more details about? has been 
the job documented (in English, Swedish is "a little difficult" for 
me...) somewhere? I can produce lists by my rought script, but it 
works on raw wiki code and the result is "dirty" - it contains markup 
words, and obviously all wrong words too (seaching for wrong words was 
my fisrt aim...). Did you work on html dump perhaps?
My code for extracting the body text from the XML dumps
has not been published. But Erik Zachte has published his
code for extracting "readable text", and maybe you can use that.
See http://stats.wikimedia.org/scripts.zip
It's only a lot of regular expressions and substitutions.
After the body text has been extracted, you can either fold
case (so Madrid becomes madrid) or not, you can either
remove interpunctiation (so e.g. becomes e g) or not,
depending on how you want to treat proper names and
abbreviations. I use simple "sed" expressions for this.
If you don't fold case and don't remove interpunctuation,
you will get a lot of false entries where sentences meet,
e.g. both "this." and "this", both "after" and "After".
-- 
   Lars Aronsson (lars@aronsson.se)
   Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Goals for Wikisource