Re: [Wikitech-l] Static HTML Dumps

5 Apr 2011

      Paul Houle wrote:
...
 I did a substantial project that worked from the XML dumps.  I 

designed a recursive descent parser in C# that,  with a few tricks,  
almost decodes wikipedia markup correctly.  Getting it right is tricky,  
for  a number of reasons,  however,  my approach preserved some 
semantics that would have been lost in the HTML dumps.
(...)
...
 In your case,  I'd do the following:  install a copy of the 

mediawiki software,
http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia 
http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia
 get a list of all the pages in the wiki by running a database 

query,  and then write a script that does http requests for all the 
pages and saves them in files.  This is programming of the simplest 
type,  but getting good speed could be a challenge.  I'd seriously 
consider using Amazon EC2 for this kind of thing,  renting a big DB 
server and a big web server,  then writing a script that does the 
download in parallel.
He could as well generate the static html dumps from that.
http://www.mediawiki.org/wiki/Extension:DumpHTML
I think he is better parsing the articles, though.
For a linguistic research you don't need things such as the contents of
templates, so a simple wikitext stripping would do. And it will be much,
much, much, much faster than parsing the whole wiki.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Static HTML Dumps