Paul Houle wrote:
I did a substantial project that worked from the
XML dumps. I
designed a recursive descent parser in C# that, with a few tricks,
almost decodes wikipedia markup correctly. Getting it right is tricky,
for a number of reasons, however, my approach preserved some
semantics that would have been lost in the HTML dumps.
(...)
In your case, I'd do the following: install
a copy of the
mediawiki software,
http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia
<http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia>
get a list of all the pages in the wiki by running a database
query, and then write a script that does http requests for all the
pages and saves them in files. This is programming of the simplest
type, but getting good speed could be a challenge. I'd seriously
consider using Amazon EC2 for this kind of thing, renting a big DB
server and a big web server, then writing a script that does the
download in parallel.
He could as well generate the static html dumps from that.
http://www.mediawiki.org/wiki/Extension:DumpHTML
I think he is better parsing the articles, though.
For a linguistic research you don't need things such as the contents of
templates, so a simple wikitext stripping would do. And it will be much,
much, much, much faster than parsing the whole wiki.