Hello,
I'm planning on using the text of Wikipedia for research purposes but I would like to remove the Wikicode, leaving only plain text. Does anyone know of scripts or programs that do this automatically?
Thanks in Advance!
Graham Neubig schrieb:
I'm planning on using the text of Wikipedia for research purposes but I would like to remove the Wikicode, leaving only plain text. Does anyone know of scripts or programs that do this automatically?
Grahan, you could perhaps use http://www.yourserver.com/yourwiki/index.php?action=raw&title=Pagetitle to get a raw dump of that page
I don't think this is what I'm looking for, in fact I want exactly the opposite. I already have the Wikicode in my database, and I want to get rid of the Wikicode tags. The raw function returns the code in its original format (Wiki tags included) without turning it into HTML, right?
On 4/20/05, Thomas Gries mail@tgries.de wrote:
Graham Neubig schrieb:
I'm planning on using the text of Wikipedia for research purposes but I would like to remove the Wikicode, leaving only plain text. Does anyone know of scripts or programs that do this automatically?
Grahan, you could perhaps use http://www.yourserver.com/yourwiki/index.php?action=raw&title=Pagetitle to get a raw dump of that page
Graham-
I'm planning on using the text of Wikipedia for research purposes but I would like to remove the Wikicode, leaving only plain text. Does anyone know of scripts or programs that do this automatically?
I advise against using this approach, since you effectively have to parse the wikitext anyway. This especially applies to templates -- the syntax {{Abc}} dynamically includes the text {{Template:Abc}} in the page where it is used. Templates are extensively used on Wikipedia, including some with parametrized name/value substitution, and if you use the wikitext as a starting point, you will have to dynamically load and process them yourself.
I recommend generating a static HTML dump instead and converting it to plaintext, for which there are a number of tools (notably lynx -dump). There is a basic static HTML dumper in the current CVS version of MediaWiki: maintenance/dumpHTML.php - see Tim Starling's mailing list post on it: http://mail.wikipedia.org/pipermail/wikitech-l/2005-April/028741.html
All best,
Erik
Erik Moeller (erik_moeller@gmx.de) [050423 11:57]:
There is a basic static HTML dumper in the current CVS version of MediaWiki: maintenance/dumpHTML.php - see Tim Starling's mailing list post on it: http://mail.wikipedia.org/pipermail/wikitech-l/2005-April/028741.html
What issues would be involved in backporting this to 1.4.x? Tim?
(Playing with Wikimedia DB dumps at present.)
- d.
David Gerard wrote:
Erik Moeller (erik_moeller@gmx.de) [050423 11:57]:
There is a basic static HTML dumper in the current CVS version of MediaWiki: maintenance/dumpHTML.php - see Tim Starling's mailing list post on it: http://mail.wikipedia.org/pipermail/wikitech-l/2005-April/028741.html
What issues would be involved in backporting this to 1.4.x? Tim?
(Playing with Wikimedia DB dumps at present.)
Backporting is trivial compared to fixing the other issues I listed.
-- Tim Starling
wikitech-l@lists.wikimedia.org