I know there's some discussion about "what's appropriate" for the
Wikipedia API, and I'd just like to share my recent experience.
I was trying to download the Wikipedia entries for people, of
which I found about 800,000. I had a scanner already written that
could do the download, so I got started.
After running for about I day, I estimated that it would take
about 20 days to bring all of the pages down through the API (running
single-threaded.) At that point I gave up, downloaded the data dump (3
hours) and wrote a script to extract the pages -- it then took about an
hour to the extraction, gzip compressing the text and inserting into a
Don't be intimidated by working with the data dumps. If you've got
an XML API that does streaming processing (I used .NET's XmlReader) and
use the old unix trick of piping the output of bunzip2 into your
program, it's really pretty easy.