On 3/10/2011 3:46 AM, David Gerard wrote:
feel the program takes 71 days to finish all the 3.1 million article titles. Is there anyway, our university IP address will be given permission or sending a official email from our department head to Wikipedia Server administrator to consider that the program, I run from this particular IP address is not any attack. so, our administrator allows us to do faster request like 0.5 sec. So, I can finish my experiment within 35 days. expecting your positive reply regards Ramesh
I can say, positively, that you'll get the job done faster by downloading the dump file and cracking into it directly. I've got scripts that can download and extract stuff from the XML dump in an hour or so. I still have some processes that use the API, but I'm increasingly using the dumps because it's faster and easier.
Note that many facts about Wikipedia topics have already been extracted by DBpedia and Freebase. These are complimentary, and if you're interested in getting results, you should use both. DBpedia has some things that aren't in Freebase, such as Wikipedia's link graph and redirects, but Freebase has a type system with 2x better recall for many of the prevalent types.
You might find that DBpedia + Freebase has the information you need. And if it doesn't, you'll still find it's a useful 'guidance control' system for anything you're doing with Wikipedia data.