Is there a standard answer to this question - how much researchers are allowed to hammer the site?
- d.
---------- Forwarded message ---------- From: ramesh kumar ramesh_chill@hotmail.com Date: 9 March 2011 09:47 Subject: Reg. Research using Wikipedia To: wikien-l@lists.wikimedia.org, wikien-l-owner@lists.wikimedia.org
Dear Members, I am Ramesh, pursuing my PhD in Monash University, Malaysia. My Research is on blog classification using Wikipedia Categories. As for my experiment, I use 12 main categories of Wikipedia. I want to identify " which particular article belongs to which main 12 categories?". So I wrote a program to collect the subcategories of each article and classify based on 12 categories offline. I have downloaded already wiki-dump which consists of around 3 million article titles. My program takes this 3 million article titles and goes to online Wikipedia website and fetch the subcategories. Our university network administrators are worried that, Wikipedia would consider as DDOS attack and could block our IP address, if my program functions. In order to get permission from Wikipedia, I was searching allover. I could able to find wikien-l members can help me. Could you please suggest me, whom to contact, what is the procedure to get approval for our IP address to do the process or other suggestions Eagerly waiting for a positive reply Thanks and Regards Ramesh
2011/3/9 David Gerard dgerard@gmail.com:
Is there a standard answer to this question - how much researchers are allowed to hammer the site?
If they use the API and wait for one request to finish before they start the next one (i.e. don't make parallel requests), that's pretty much always fine.
Roan Kattouw (Catrope)
Dear Members, I am Ramesh, pursuing my PhD in Monash University, Malaysia. My Research is on blog classification using Wikipedia Categories. As for my experiment, I use 12 main categories of Wikipedia. I want to identify " which particular article belongs to which main 12 categories?". So I wrote a program to collect the subcategories of each article and classify based on 12 categories offline. I have downloaded already wiki-dump which consists of around 3 million article titles. My program takes this 3 million article titles and goes to online Wikipedia website and fetch the subcategories.
Why do you need to access the live wikipedia for this? Using categorylinks.sql and page.sql you should be able to fetch the same data. Probably faster.
Why do you need to access the live wikipedia for this? Using categorylinks.sql and page.sql you should be able to fetch the same data. Probably faster.
In my research, the answer to this question is two-fold
A) Creating a local copy of wikipedia (using mediawiki and various import tools) is quite a process, and requires a significant investment of time and research unto itself.
B) A few months ago, I pulled 333 semi-random articles from the live API -- of those, 329 of them have significant enough changes since 20100312 dump (which was the newest dump at the time). A new check against the 20110115 dump has similar percentage.
Caveat -- my research is largely centered around the infobox template usage, which is a relatively new deal, so they are generally being updated frequently.
-- James
James Linden wrote:
Why do you need to access the live wikipedia for this? Using categorylinks.sql and page.sql you should be able to fetch the same data. Probably faster.
In my research, the answer to this question is two-fold
A) Creating a local copy of wikipedia (using mediawiki and various import tools) is quite a process, and requires a significant investment of time and research unto itself.
You don't need to do a full copy to eg. fetch infoboxes.
B) A few months ago, I pulled 333 semi-random articles from the live API -- of those, 329 of them have significant enough changes since 20100312 dump (which was the newest dump at the time). A new check against the 20110115 dump has similar percentage.
Getting updated data may be a reason, but I don't think that's what Ramesh wanted. Plus, you wanted 333 articles, not the 3 million...
On 3/9/2011 11:29 AM, James Linden wrote:
Why do you need to access the live wikipedia for this? Using categorylinks.sql and page.sql you should be able to fetch the same data. Probably faster.
In my research, the answer to this question is two-fold
A) Creating a local copy of wikipedia (using mediawiki and various import tools) is quite a process, and requires a significant investment of time and research unto itself.
You don't need a local copy of MediaWiki or any special tools to use the SQL dumps, just MySQL.
On 9 March 2011 16:00, Platonides Platonides@gmail.com wrote:
Dear Members, I am Ramesh, pursuing my PhD in Monash University, Malaysia. My Research is on blog classification using Wikipedia Categories. As for my experiment, I use 12 main categories of Wikipedia. I want to identify " which particular article belongs to which main 12 categories?". So I wrote a program to collect the subcategories of each article and classify based on 12 categories offline. I have downloaded already wiki-dump which consists of around 3 million article titles. My program takes this 3 million article titles and goes to online Wikipedia website and fetch the subcategories.
Why do you need to access the live wikipedia for this? Using categorylinks.sql and page.sql you should be able to fetch the same data. Probably faster.
I concur. Everything required for this project should be in the dumps.
wikitech-l@lists.wikimedia.org