Short question:
I'm getting inconsistent 403 forbidden errors when trying to read wikipedia.org content via PHP's file_get_contents().
I believe I'm operating within the limits and terms identified here:
http://en.wikipedia.org/wiki/ Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Am I being blocked (inconsistently)? How would I find out?
Background:
Notwithstanding traditional librarian concerns about the authority of Wikipedia, I'm working on ways to use Wikipedia data in the library context.
Longer question:
My app uses the all_titles_in_ns0 export. If the search matches one of those titles it fetches that page from en.wikipedia.org to generate a summary and cache the result (future searches are returned from cache).
In trying to sidestep the 403 forbidden errors (and eliminate any complaints about scraping wikipedia.org), I've attempted to bring up a private copy of en.wikipedia.org based on MediaWIki and the pages- articles.xml export. This should work, as MWDumper seems to import the contents okay, but the results leave me with a huge number of missing pages.
So, at the end of the day, my question is: how can I either more reliably fetch pages from wikipedia.org _or_ more reliably create a duplicate wiki that I can scrape.
(Those who are interested can contact me privately for URLs to see this at work. I'd rather not make them public in developmental form.)
Thank you,
Casey Bisson __________________________________________
e-Learning Application Developer Plymouth State University Plymouth, New Hampshire http://oz.plymouth.edu/~cbisson/ ph: 603-535-2256