Paul Houle wrote:
I did a substantial project that worked from the XML dumps. I
designed a recursive descent parser in C# that, with a few tricks, almost decodes wikipedia markup correctly. Getting it right is tricky, for a number of reasons, however, my approach preserved some semantics that would have been lost in the HTML dumps.
(...)
In your case, I'd do the following: install a copy of the
mediawiki software,
http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia
get a list of all the pages in the wiki by running a database
query, and then write a script that does http requests for all the pages and saves them in files. This is programming of the simplest type, but getting good speed could be a challenge. I'd seriously consider using Amazon EC2 for this kind of thing, renting a big DB server and a big web server, then writing a script that does the download in parallel.
He could as well generate the static html dumps from that. http://www.mediawiki.org/wiki/Extension:DumpHTML
I think he is better parsing the articles, though.
For a linguistic research you don't need things such as the contents of templates, so a simple wikitext stripping would do. And it will be much, much, much, much faster than parsing the whole wiki.