Re: [Wikitech-l] Static HTML Dumps

5 Apr 2011


      On 4/5/2011 9:38 AM, Samat wrote:
...
Hi Huib,
Thank you for your prompt answer.
I know XML dumps, but I looking for HTML dumps for a linguistic (and
related) research. This research would like to use full text of Wikipedia
articles without any other content.
The researcher asked me helping to find the solution avoid writing a program
to generate this from the xml.
(I am not a programmer but there could be problems with included pages and
templates, and other special contents.)
Best regards,
Samat
I did a substantial project that worked from the XML dumps.  I 
designed a recursive descent parser in C# that,  with a few tricks,  
almost decodes wikipedia markup correctly.  Getting it right is tricky,  
for  a number of reasons,  however,  my approach preserved some 
semantics that would have been lost in the HTML dumps.
If I had to do it all again,  I'd probably download a copy of the 
mediawiki software,  load Wikipedia into it,  and then add some 'hooks' 
into the template handling code that would let me put traps on important 
templates and other parts of the markup handling so template nesting 
gets handled correctly.
Old static dumps,  going back to June 2008,  are available:
http://static.wikipedia.org/
In your case,  I'd do the following:  install a copy of the 
mediawiki software,
http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia 
http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia
get a list of all the pages in the wiki by running a database 
query,  and then write a script that does http requests for all the 
pages and saves them in files.  This is programming of the simplest 
type,  but getting good speed could be a challenge.  I'd seriously 
consider using Amazon EC2 for this kind of thing,  renting a big DB 
server and a big web server,  then writing a script that does the 
download in parallel.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Static HTML Dumps