[Wikitech-l] Getting a local dump of Wikipedia in HTML

4 May 2018


      Hi all,
I am wondering what is the fastest/best way to get a local dump of 
English Wikipedia in HTML? We are looking just for the current versions 
(no edit history) of articles for the purposes of a research project.
We have been exploring using bliki [1] to do the conversion of the 
source markup in the Wikipedia dumps to HTML, but the latest version 
seems to take on average several seconds per article (including after 
the most common templates have been downloaded and stored locally). This 
means it would take several months to convert the dump.
We also considered using Nutch to crawl Wikipedia, but with a reasonable 
crawl delay (5 seconds) it would several months to get a copy of every 
article in HTML (or at least the "reachable" ones).
Hence we are a bit stuck right now and not sure how to proceed. Any 
help, pointers or advice would be greatly appreciated!!
Best,
Aidan
[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Getting a local dump of Wikipedia in HTML