Francis Tyers wrote:
Actually it is surprisingly difficult. I have a script which goes it here:
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-lea...
Which really needs to be redone for each Wikipedia. If you ask
http://en.wikipedia.org/wiki/User:Tresoldi#Wikipedia_as_a_corpus
He has some scripts which do it too. But there is no generic "nice" way of getting Wikipedia as a nice plain text corpus so far. If anyone has one I would love to hear about it.
Convert to html using mediawiki, then filter out all html tags.