building a search API - Wikitech-l

2 Oct 2006

Short question:

I'm getting inconsistent 403 forbidden errors when trying to read  
wikipedia.org content via PHP's file_get_contents().

I believe I'm operating within the limits and terms identified here:

http://en.wikipedia.org/wiki/ 
Wikipedia:Database_download#Please_do_not_use_a_web_crawler

Am I being blocked (inconsistently)? How would I find out?

Background:

Notwithstanding traditional librarian concerns about the authority of  
Wikipedia, I'm working on ways to use Wikipedia data in the library  
context.

Longer question:

My app uses the all_titles_in_ns0 export. If the search matches one  
of those titles it fetches that page from en.wikipedia.org to  
generate a summary and cache the result (future searches are returned  
from cache).

In trying to sidestep the 403 forbidden errors (and eliminate any  
complaints about scraping wikipedia.org), I've attempted to bring up  
a private copy of en.wikipedia.org based on MediaWIki and the pages- 
articles.xml export. This should work, as MWDumper seems to import  
the contents okay, but the results leave me with a huge number of  
missing pages.

So, at the end of the day, my question is: how can I either more  
reliably fetch pages from wikipedia.org _or_ more reliably create a  
duplicate wiki that I can scrape.

(Those who are interested can contact me privately for URLs to see  
this at work. I'd rather not make them public in developmental form.)

Thank you,

Casey Bisson
__________________________________________

e-Learning Application Developer
Plymouth State University
Plymouth, New Hampshire
http://oz.plymouth.edu/~cbisson/
ph: 603-535-2256