I'm trying to access a Wikipedia page with the PEAR::HTTP_Request class in PHP. When I try to access http://en.wikipedia.org/, the request is as follows:
ERROR The requested URL could not be retrieved
---------------------------------------------------------------------------- ---- While trying to retrieve the URL: http://en.wikipedia.org/
The following error was encountered:
* Access Denied. Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.
Your cache administrator is wikidown@bomis.com. ---------------------------------------------------------------------------- ---- Generated Wed, 07 Dec 2005 20:34:13 GMT by srv6.wikimedia.org (squid/2.5.STABLE12)
It's not that my user agent is blocked, because it's sending: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1). I would send my own user agent string and not fake another, but from what I hear that's not necessary for most minor projects such as mine.
Could my IP be blocked, and if so, could I find out why? If not, are there any little tricks I need to know about a PHP script connecting to Wikipedia?
Thank you,
-HoodedMan "Wind to thy wings. Light to thy path. Dreams to thy heart."
Are you sure you're setting the header correctly? The following works for me.
$r = new HTTP_Request($url); $r->addHeader('User-Agent', 'User Agent String'); $r->sendRequest();
Mark
-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org]On Behalf Of The Hooded Man Sent: December 7, 2005 3:42 PM To: wikitech-l@wikipedia.org Subject: [Bulk] [Wikitech-l] Access Denied from Wikipedia's proxies
I'm trying to access a Wikipedia page with the PEAR::HTTP_Request class in PHP. When I try to access http://en.wikipedia.org/, the request is as follows:
ERROR The requested URL could not be retrieved
While trying to retrieve the URL: http://en.wikipedia.org/
The following error was encountered:
- Access Denied. Access control configuration prevents your request from
being allowed at this time. Please contact your service provider if you feel this is incorrect.
Your cache administrator is wikidown@bomis.com.
Generated Wed, 07 Dec 2005 20:34:13 GMT by srv6.wikimedia.org (squid/2.5.STABLE12)
It's not that my user agent is blocked, because it's sending: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1). I would send my own user agent string and not fake another, but from what I hear that's not necessary for most minor projects such as mine.
Could my IP be blocked, and if so, could I find out why? If not, are there any little tricks I need to know about a PHP script connecting to Wikipedia?
Thank you,
-HoodedMan "Wind to thy wings. Light to thy path. Dreams to thy heart."
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
That's practically the same code as mine. No luck.
-HoodedMan "Wind to thy wings. Light to thy path. Dreams to thy heart."
-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org] On Behalf Of Mark Jeays Sent: Wednesday, December 07, 2005 4:51 PM To: Wikimedia developers Subject: RE: [Bulk] [Wikitech-l] Access Denied from Wikipedia's proxies
Are you sure you're setting the header correctly? The following works for me.
$r = new HTTP_Request($url); $r->addHeader('User-Agent', 'User Agent String'); $r->sendRequest();
Mark _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
The Hooded Man wrote:
It's not that my user agent is blocked, because it's sending: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1). I would send my own user agent string and not fake another, but from what I hear that's not necessary for most minor projects such as mine.
If we find you doing this, you will be blocked from access to Wikipedia. Falsifying user-agents is a fraudulent tool used by leeches trying to evade blocks.
*Always* use a real user-agent string which identifies you, including enough contact information (e-mail address or URL) that you can be reached in case you're running a legitimate bot of some sort that's causing problems by accident.
Could my IP be blocked, and if so, could I find out why? If not, are there any little tricks I need to know about a PHP script connecting to Wikipedia?
Might be, but we can't tell if you won't say what it is.
If you are grabbing pages live from Wikipedia and displaying them on your site with ads attached, or some other such, you will indeed be permanently blocked when you're discovered.
-- brion vibber (brion @ pobox.com)
-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org] On Behalf Of Brion Vibber Sent: Wednesday, December 07, 2005 5:34 PM To: Wikimedia developers Subject: Re: [Wikitech-l] Access Denied from Wikipedia's proxies
If we find you doing this, you will be blocked from access to Wikipedia. Falsifying user-agents is a fraudulent tool used by leeches trying to evade blocks.
*Always* use a real user-agent string which identifies you, including enough contact information (e-mail address or URL) that you can be reached in case you're running a legitimate bot of some sort that's causing problems by accident.
Ah, apologies. I actually read one of Jimbo's quotes wrong; It said: "You could give it a User-Agent string that's exactly the same as any popular browser, and we'd never know the difference." I missed the next sentence, "So it's good of you to ask." Oops.
The same list thread (archived) seemed to imply that bots were allowed on a bot-by-bot basis (with each user agent being allowed at a time). This is obviously not the case. I will certainly have my bot use a proper User-Agent string; it was before I changed it to see if that was the problem.
Might be, but we can't tell if you won't say what it is.
I was leery about doing so on a public mailing list, but it's a host, I suppose, so no harm done. I figured it probably wasn't a block and was something simple I was overlooking, so the IP address wouldn't really be needed
I believe it is 64.202.163.79.
If you are grabbing pages live from Wikipedia and displaying them on your site with ads attached, or some other such, you will indeed be permanently blocked when you're discovered.
Not at all what I'm doing. I'll release the source code if you'd prefer when I'm done; I'd just like it to pull up a basic page first. :P
-- brion vibber (brion @ pobox.com)
Thank you for your help,
-HoodedMan "Wind to thy wings. Light to thy path. Dreams to thy heart."
wikitech-l@lists.wikimedia.org