Re: [Wikitech-l] Acceptable use of API

24 Sep 2010

  On 9/24/2010 8:49 AM, Robin Ryder wrote:
...
  Hi,

 Thanks for the quick answers, and for the useful link.

 My previous e-mail was not detailed enough; sorry about that. Let me
 clarify:
 - I don't need to crawl the entire Wikipedia, only (for example) articles in
 a category. ~1,000 articles would be a good start, and I definitely won't be
 going above ~40,000 articles.
 - For every article in the data set, I need to follow every interlanguage
 link, and get the article creation date (i.e. creation date of [[en:Brad
 Pitt]], [[fr:Brad Pitt]], [[it:Brad Pitt]], etc). As far as I can tell, this
 means that I need one query for every language link.

 The data are reasonably easy to get through the API. If my queries risk
 overloading the server, I am obviously happy to go through the toolserver
 (once my account gets approved!).

The first part is easy to do if accuracy doesn't matter.  Precision and 
recall are often around 50% for categories in Wikipedia,  so if you 
really care about being right you have to construct your own 
categories,  and it helps to have a synoptic view.   Often you can get 
that view from Freebase and Dbpedia but I'm increasingly coming around 
to infoexing wikipedia directly because,  for things I care about,  I 
can do better than Dbpedia...  Freebase does add some special value 
because they do gardening,  data cleaning,  data mining,  hand edits and 
other things that clean up the mess.

Secondly,  it's not hard at all to run,  say,  200k requests against the 
API over the span of a few days.  I think you could get your creation 
dates from the history records.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Acceptable use of API