---------- Forwarded message ---------- From: ramesh kumar ramesh_chill@hotmail.com Date: 9 March 2011 13:27 Subject: RE: Reg. Research using Wikipedia To: dgerard@gmail.com
Dear Mr.Gerard, Thanks for your instant response. But is there a time-gap to check between one request into another request. for ex: like 1 sec, or 1 milli sec. If so, I can set a sleep state in my program. At the same, I have 3.1 million (3101144) wiki article titles. So if I set 1 sec between one request, so for 1 day it takes 60(sec)*60(min)*24(hr)=86400 /2= 43200 requests per day(considering 1 sec sleep between 1 request to the other) 3101144/43200=71 days. I feel the program takes 71 days to finish all the 3.1 million article titles. Is there anyway, our university IP address will be given permission or sending a official email from our department head to Wikipedia Server administrator to consider that the program, I run from this particular IP address is not any attack. so, our administrator allows us to do faster request like 0.5 sec. So, I can finish my experiment within 35 days. expecting your positive reply regards Ramesh
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Date: Wed, 9 Mar 2011 10:39:43 +0000 Subject: Re: Reg. Research using Wikipedia From: dgerard@gmail.com To: ramesh_chill@hotmail.com
I asked the wikitech-l list, which is where the system administrators talk, and they said:
"If they use the API and wait for one request to finish before they start the next one (i.e. don't make parallel requests), that's pretty much always fine."
http://lists.wikimedia.org/pipermail/wikitech-l/2011-March/052137.html
Hopefully this will put your network administrators' minds at rest :-)
- d.
On 9 March 2011 09:47, ramesh kumar ramesh_chill@hotmail.com wrote:
Dear Members, I am Ramesh, pursuing my PhD in Monash University, Malaysia. My Research is on blog classification using Wikipedia Categories. As for my experiment, I use 12 main categories of Wikipedia. I want to identify " which particular article belongs to which main 12 categories?". So I wrote a program to collect the subcategories of each article and classify based on 12 categories offline. I have downloaded already wiki-dump which consists of around 3 million article titles. My program takes this 3 million article titles and goes to online Wikipedia website and fetch the subcategories. Our university network administrators are worried that, Wikipedia would consider as DDOS attack and could block our IP address, if my program functions. In order to get permission from Wikipedia, I was searching allover. I could able to find wikien-l members can help me. Could you please suggest me, whom to contact, what is the procedure to get approval for our IP address to do the process or other suggestions Eagerly waiting for a positive reply Thanks and Regards Ramesh
On 3/10/2011 3:46 AM, David Gerard wrote:
feel the program takes 71 days to finish all the 3.1 million article titles. Is there anyway, our university IP address will be given permission or sending a official email from our department head to Wikipedia Server administrator to consider that the program, I run from this particular IP address is not any attack. so, our administrator allows us to do faster request like 0.5 sec. So, I can finish my experiment within 35 days. expecting your positive reply regards Ramesh
I can say, positively, that you'll get the job done faster by downloading the dump file and cracking into it directly. I've got scripts that can download and extract stuff from the XML dump in an hour or so. I still have some processes that use the API, but I'm increasingly using the dumps because it's faster and easier.
Note that many facts about Wikipedia topics have already been extracted by DBpedia and Freebase. These are complimentary, and if you're interested in getting results, you should use both. DBpedia has some things that aren't in Freebase, such as Wikipedia's link graph and redirects, but Freebase has a type system with 2x better recall for many of the prevalent types.
You might find that DBpedia + Freebase has the information you need. And if it doesn't, you'll still find it's a useful 'guidance control' system for anything you're doing with Wikipedia data.
On 3/10/11 6:29 AM, Paul Houle wrote:
I can say, positively, that you'll get the job done faster by
downloading the dump file and cracking into it directly. I've got scripts that can download and extract stuff from the XML dump in an hour or so. I still have some processes that use the API, but I'm increasingly using the dumps because it's faster and easier.
You're likely correct - also I've recently been exposed to the 'wikipedia offline patch' extension (http://code.google.com/p/wikipedia-offline-patch/) which I believe allows you to use a compressed dump as your db stroage - saving you the pain/space of uncompressing a dump file. Probably worth a look.
Arthur
On 10/03/11 7:13 PM, Arthur Richards wrote:
You're likely correct - also I've recently been exposed to the 'wikipedia offline patch' extension (http://code.google.com/p/wikipedia-offline-patch/) which I believe allows you to use a compressed dump as your db stroage - saving you the pain/space of uncompressing a dump file. Probably worth a look.
Arthur
Having invented myself the same wheel years ago, I would be very surprised if the couple of branches used on its select was able to fill all queries. And no, ttsiod approach won't be able to reach all articles.
wikitech-l@lists.wikimedia.org