API vs data dumps - Wikitech-l

13 Oct 2010

      I know there's some discussion about "what's appropriate" for the

Wikipedia API,  and I'd just like to share my recent experience.

     I was trying to download the Wikipedia entries for people,  of 
which I found about 800,000.   I had a scanner already written that 
could do the download,  so I got started.

     After running for about I day,  I estimated that it would take 
about 20 days to bring all of the pages down through the API (running 
single-threaded.)  At that point I gave up,  downloaded the data dump (3 
hours) and wrote a script to extract the pages -- it then took about an 
hour to the extraction,  gzip compressing the text and inserting into a 
mysql database.

     Don't be intimidated by working with the data dumps.  If you've got 
an XML API that does streaming processing (I used .NET's XmlReader) and 
use the old unix trick of piping the output of bunzip2 into your 
program,  it's really pretty easy.