Hello,
I have been reading the "recent changes" and "new articles" RSS-feeds through My Yahoo. Now I've written a small Python program to fetch and parse these feeds, but it fails most of the time with a message about "intermittent server problem" and warning that my user-agent may be blocked. Since the feeds are available through my browser, I conclude that I have been blocked. (My program sends the user-agent, 'WikiWalker' and provides my email address).
I have also published RSS feeds, and I know that RSS readers and aggregation sites are among the most aggressive 'attackers' - frequently ignoring robots.txt, cache-busting, and requesting pages at outrageous speeds. But my WikiWalker is really benign, so how can I get un-blocked?
TIA,
Ken
On Oct 23, 2004, at 8:21 AM, Ken Ara wrote:
I have been reading the "recent changes" and "new articles" RSS-feeds through My Yahoo. Now I've written a small Python program to fetch and parse these feeds, but it fails most of the time with a message about "intermittent server problem" and warning that my user-agent may be blocked. Since the feeds are available through my browser, I conclude that I have been blocked. (My program sends the user-agent, 'WikiWalker' and provides my email address).
Can you provide a dump of the sent and received HTTP headers? You might just be getting timeouts, the servers have had some ups and downs lately.
-- brion vibber (brion @ pobox.com)
Thank you Brion and Mark.
I'm sorry but I don't know how to obtain the headers, however I am able to view the RSS files without problem with a browser, so I don't think it's a Wikipedia server issue. I changed the user-agent sent by the script to that of a popular browser (hadn't thought of that...) with no improvement. Finally went to FeedBurner.com and gave them the file to fetch. My program has been able to read it off the FeedBurner site without problem (updates every 30 mins, not ideal).
Any further insights or suggestions are welcome.
Ken
--- Brion Vibber brion@pobox.com wrote:
On Oct 23, 2004, at 8:21 AM, Ken Ara wrote:
I have been reading the "recent changes" and "new articles" RSS-feeds through My Yahoo. Now I've
written
a small Python program to fetch and parse these
feeds,
but it fails most of the time with a message about "intermittent server problem" and warning that my user-agent may be blocked. Since the feeds are available through my browser, I conclude that I
have
been blocked. (My program sends the user-agent, 'WikiWalker' and provides my email address).
Can you provide a dump of the sent and received HTTP headers? You might just be getting timeouts, the servers have had some ups and downs lately.
-- brion vibber (brion @ pobox.com)
ATTACHMENT part 1.2 application/pgp-signature
x-mac-type=70674453; name=PGP.sig
Wikitech-l mailing list Wikitech-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
_______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com
On Oct 24, 2004, at 7:28 AM, Ken Ara wrote:
I'm sorry but I don't know how to obtain the headers,
If you're using urllib: import httplib httplib.HTTPConnection.debuglevel = 1
Or use a packet sniffer like ethereal to grab the HTTP traffic.
Basically I'm trying to find out: a) that your robot is sending all the correct headers, such as the custom user-agent and the host: header b) find out what response is being returned that looks like you've been "blocked"
-- brion vibber (brion @ pobox.com)
Another possibility could be that he failed to change the user-agent, and the bot uses the default Python user agent, which is blocked because the Python Wikipediabot used it in the past, and at that time did not throttle and tended to load pages once to check whether they existed, and then a second time to read their content.
I do wonder why we are still blocking it more than a year afterward...
Andre Engels
On Sat, 23 Oct 2004 14:04:44 -0700, Brion Vibber brion@pobox.com wrote:
On Oct 23, 2004, at 8:21 AM, Ken Ara wrote:
I have been reading the "recent changes" and "new articles" RSS-feeds through My Yahoo. Now I've written a small Python program to fetch and parse these feeds, but it fails most of the time with a message about "intermittent server problem" and warning that my user-agent may be blocked. Since the feeds are available through my browser, I conclude that I have been blocked. (My program sends the user-agent, 'WikiWalker' and provides my email address).
Can you provide a dump of the sent and received HTTP headers? You might just be getting timeouts, the servers have had some ups and downs lately.
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Hi André, Hi Tom,
I did change the user-agent: I knew py-urllib had been banned but not why. Still, I think unique user-agents should be used - nothing wrong with a little accountability - when I spot an obviously faked one I usually deny it as being up to no good. *#&%@(*!!
My 'robot' hardly deserves the name; it only fetches and parses the Wikipedia RSS feeds. I'll have a look at PWB, sounds interesting.
Ken
--- Andre Engels andreengels@gmail.com wrote:
Another possibility could be that he failed to change the user-agent, and the bot uses the default Python user agent, which is blocked because the Python Wikipediabot used it in the past, and at that time did not throttle and tended to load pages once to check whether they existed, and then a second time to read their content.
I do wonder why we are still blocking it more than a year afterward...
Andre Engels
On Sat, 23 Oct 2004 14:04:44 -0700, Brion Vibber brion@pobox.com wrote:
On Oct 23, 2004, at 8:21 AM, Ken Ara wrote:
I have been reading the "recent changes" and
"new
articles" RSS-feeds through My Yahoo. Now I've
written
a small Python program to fetch and parse these
feeds,
but it fails most of the time with a message
about
"intermittent server problem" and warning that
my
user-agent may be blocked. Since the feeds are available through my browser, I conclude that I
have
been blocked. (My program sends the user-agent, 'WikiWalker' and provides my email address).
Can you provide a dump of the sent and received
HTTP headers? You might
just be getting timeouts, the servers have had
some ups and downs
lately.
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
_______________________________ Do you Yahoo!? Express yourself with Y! Messenger! Free. Download now. http://messenger.yahoo.com
Hi, You know about pywikipediabot.sf.net ? Maybe you can include your bot, that will solve your problem.
ciao, tom
wikitech-l@lists.wikimedia.org