Re: [Wikisource-l] Goals for Wikisource

29 Jul 2010


      2010/7/29 Lars Aronsson lars@aronsson.se
...
My code for extracting the body text from the XML dumps
has not been published. But Erik Zachte has published his
code for extracting "readable text", and maybe you can use that.
See http://stats.wikimedia.org/scripts.zip
It's only a lot of regular expressions and substitutions.
Thanks Lars for  details!  From xml dump: this is what I 'd like to know
(the same I do). HTML is too interesting as a source, since "absolutely not
well formed wiki syntax" is replaced by a "well formed html syntax", but so
far I didn't explore it.
Thanks too for your link.
Alex

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Goals for Wikisource