Re: [Wikipedia-l] convert wiki markup to plain text

31 Mar 2010


      Francis Tyers wrote:
...
Actually it is surprisingly difficult. I have a script which goes it
here:
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-lea...
Which really needs to be redone for each Wikipedia. If you ask
http://en.wikipedia.org/wiki/User:Tresoldi#Wikipedia_as_a_corpus
He has some scripts which do it too. But there is no generic "nice" way
of getting Wikipedia as a nice plain text corpus so far. If anyone has
one I would love to hear about it.
Convert to html using mediawiki, then filter out all html tags.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] convert wiki markup to plain text