[Wikitech-l] API vs data dumps

13 Oct 2010


      I know there's some discussion about "what's appropriate" for the 
Wikipedia API,  and I'd just like to share my recent experience.
I was trying to download the Wikipedia entries for people,  of 
which I found about 800,000.   I had a scanner already written that 
could do the download,  so I got started.
After running for about I day,  I estimated that it would take 
about 20 days to bring all of the pages down through the API (running 
single-threaded.)  At that point I gave up,  downloaded the data dump (3 
hours) and wrote a script to extract the pages -- it then took about an 
hour to the extraction,  gzip compressing the text and inserting into a 
mysql database.
Don't be intimidated by working with the data dumps.  If you've got 
an XML API that does streaming processing (I used .NET's XmlReader) and 
use the old unix trick of piping the output of bunzip2 into your 
program,  it's really pretty easy.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] API vs data dumps