2010/7/29 Lars Aronsson
<lars@aronsson.se>
My code for extracting the body text from the XML dumps
has not been published. But Erik Zachte has published his
code for extracting "readable text", and maybe you can use that.
See http://stats.wikimedia.org/scripts.zip
It's only a lot of regular expressions and substitutions.
Thanks Lars for details! From xml dump: this is what I 'd like to know (the same I do). HTML is too interesting as a source, since "absolutely not well formed wiki syntax" is replaced by a "well formed html syntax", but so far I didn't explore it.
Thanks too for your link.
Alex