Short question:
I'm getting inconsistent 403 forbidden errors when trying to read wikipedia.org content via PHP's file_get_contents().
I believe I'm operating within the limits and terms identified here:
http://en.wikipedia.org/wiki/ Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Am I being blocked (inconsistently)? How would I find out?
Background:
Notwithstanding traditional librarian concerns about the authority of Wikipedia, I'm working on ways to use Wikipedia data in the library context.
Longer question:
My app uses the all_titles_in_ns0 export. If the search matches one of those titles it fetches that page from en.wikipedia.org to generate a summary and cache the result (future searches are returned from cache).
In trying to sidestep the 403 forbidden errors (and eliminate any complaints about scraping wikipedia.org), I've attempted to bring up a private copy of en.wikipedia.org based on MediaWIki and the pages- articles.xml export. This should work, as MWDumper seems to import the contents okay, but the results leave me with a huge number of missing pages.
So, at the end of the day, my question is: how can I either more reliably fetch pages from wikipedia.org _or_ more reliably create a duplicate wiki that I can scrape.
(Those who are interested can contact me privately for URLs to see this at work. I'd rather not make them public in developmental form.)
Thank you,
Casey Bisson __________________________________________
e-Learning Application Developer Plymouth State University Plymouth, New Hampshire http://oz.plymouth.edu/~cbisson/ ph: 603-535-2256
Short question:
I'm getting inconsistent 403 forbidden errors when trying to read wikipedia.org content via PHP's file_get_contents().
Getting the same result: ================================== root@bling:~# php --run 'print(file_get_contents("http://en.wikipedia.org/wiki/Outer_Space"));'
Warning: file_get_contents(http://en.wikipedia.org/wiki/Outer_Space): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in Command line code on line 1
Call Stack: 0.0002 69840 1. {main}() Command line code:0 0.0002 69912 2. file_get_contents() Command line code:1 root@bling:~# ==================================
Hmmmm.
My suspicion is that it's just wanting a User-Agent header. In which case a) check that you're operating within the guidelines, and b) if you believe you are, just give it one that accurately describes what you're doing. For example this works for me: ================================== root@bling:~# php --run 'print(file_get_contents("http://en.wikipedia.org/wiki/Outer_Space%22,FALSE,stream_context_create(arra..." => array ("header"=> "User-Agent: Casey Bisson search API")))));'
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr"> <head> [...snip...] </div> <!-- Served by srv28 in 0.325 secs. --></body></html>
root@bling:~# ==================================
All the best, Nick.
Nick Jenkins wrote:
root@bling:~# php --run 'print(file_get_contents("http://en.wikipedia.org/wiki/Outer_Space%22,FALSE,stream_context_create(arra..." => array ("header"=> "User-Agent: Casey Bisson search API")))));'
I believe you can also use
ini_set('user_agent', 'Casey Bisson search API'); file_get_contents('http://en.wikipedia.org/wiki/Outer_Space');
Easier on the eyes perhaps?
-- Tim Starling
wikitech-l@lists.wikimedia.org