Short question:
I'm getting inconsistent 403 forbidden errors when trying to read
wikipedia.org content via PHP's file_get_contents().
I believe I'm operating within the limits and terms identified here:
http://en.wikipedia.org/wiki/
Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Am I being blocked (inconsistently)? How would I find out?
Background:
Notwithstanding traditional librarian concerns about the authority of
Wikipedia, I'm working on ways to use Wikipedia data in the library
context.
Longer question:
My app uses the all_titles_in_ns0 export. If the search matches one
of those titles it fetches that page from
en.wikipedia.org to
generate a summary and cache the result (future searches are returned
from cache).
In trying to sidestep the 403 forbidden errors (and eliminate any
complaints about scraping
wikipedia.org), I've attempted to bring up
a private copy of
en.wikipedia.org based on MediaWIki and the pages-
articles.xml export. This should work, as MWDumper seems to import
the contents okay, but the results leave me with a huge number of
missing pages.
So, at the end of the day, my question is: how can I either more
reliably fetch pages from
wikipedia.org _or_ more reliably create a
duplicate wiki that I can scrape.
(Those who are interested can contact me privately for URLs to see
this at work. I'd rather not make them public in developmental form.)
Thank you,
Casey Bisson
__________________________________________
e-Learning Application Developer
Plymouth State University
Plymouth, New Hampshire
http://oz.plymouth.edu/~cbisson/
ph: 603-535-2256