Hello,
I download a copy of wikipedia database, which is a XML file after decompressed. As you know ,in the XML file, there are many wiki markup, now I want to convert the wiki markup to plain text. Is there any tool I can use ?
Thanks!
El dt 16 de 03 de 2010 a les 19:36 +0800, en/na OrzzrO va escriure:
Hello,
I download a copy of wikipedia database, which is a XML file after
decompressed. As you know ,in the XML file, there are many wiki markup, now I want to convert the wiki markup to plain text. Is there any tool I can use ?
Thanks!
Actually it is surprisingly difficult. I have a script which goes it here:
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-lea...
Which really needs to be redone for each Wikipedia. If you ask
http://en.wikipedia.org/wiki/User:Tresoldi#Wikipedia_as_a_corpus
He has some scripts which do it too. But there is no generic "nice" way of getting Wikipedia as a nice plain text corpus so far. If anyone has one I would love to hear about it.
Francis Tyers wrote:
Actually it is surprisingly difficult. I have a script which goes it here:
https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-lea...
Which really needs to be redone for each Wikipedia. If you ask
http://en.wikipedia.org/wiki/User:Tresoldi#Wikipedia_as_a_corpus
He has some scripts which do it too. But there is no generic "nice" way of getting Wikipedia as a nice plain text corpus so far. If anyone has one I would love to hear about it.
Convert to html using mediawiki, then filter out all html tags.
wikipedia-l@lists.wikimedia.org