2010/7/29 Lars Aronsson lars@aronsson.se
My code for extracting the body text from the XML dumps has not been published. But Erik Zachte has published his code for extracting "readable text", and maybe you can use that. See http://stats.wikimedia.org/scripts.zip It's only a lot of regular expressions and substitutions.
Thanks Lars for details! From xml dump: this is what I 'd like to know (the same I do). HTML is too interesting as a source, since "absolutely not well formed wiki syntax" is replaced by a "well formed html syntax", but so far I didn't explore it.
Thanks too for your link.
Alex