"Axel Boldt" axelboldt@yahoo.com schrieb:
Do we forbid certain spiders access to the site based on User-Agent? A user in a German forum reported recently that he couldn't access Wikipedia at all, always receiving a "Forbidden" message. It turned out that his webwasher proxy was to blame (an ad banner block). The proxy sends the User-Agent
"Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt) WebWasher 3.0"
Webwasher cannot be used to spider and download sites.
We forbid spiders based on User Agent, but WebWasher seems not to be in the list. according to http://www.wikipedia.org/robots.txt, the following User-Agents are disallowed:
UbiCrawler DOC Zao sitecheck.internetseer.com Zealbot MSIECrawler SiteSnagger WebStripper WebCopier Fetch Ofline Explorer Teleport TeleportPro WebZIP linko HTTrack Microsoft.URL.Control Xenu larbin libwww ZyBORG Download Ninja wget grub-client k2spider NPBot HTTrack
Furthermore I know that any request without a User Agent is refused. There might be others, but someone who knows more about it than me should check that.
Andre Engels
On Feb 12, 2004, at 04:36, Andre Engels wrote:
"Axel Boldt" axelboldt@yahoo.com schrieb:
Do we forbid certain spiders access to the site based on User-Agent? A user in a German forum reported recently that he couldn't access Wikipedia at all, always receiving a "Forbidden" message. It turned out that his webwasher proxy was to blame (an ad banner block). The proxy sends the User-Agent
"Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt) WebWasher 3.0"
I can access pages on de.wikipedia.org using the above user-agent string, but don't have WebWasher to test with.
Webwasher cannot be used to spider and download sites.
We forbid spiders based on User Agent, but WebWasher seems not to be in the list. according to http://www.wikipedia.org/robots.txt, the following User-Agents are disallowed:
robots.txt isn't based on user-agent. It works on the honor system; if the client doesn't obey robots.txt, there is no blocking caused by robots.txt.
Here's the current user-agent block list:
# we don't like these user-agents: acl stayaway browser ^Iala acl stayaway browser ^Teleport acl stayaway browser ^WebStripper acl stayaway browser ^Snoopy acl stayaway browser grub acl stayaway browser ZyBorg acl stayaway browser linko acl stayaway browser FAST acl stayaway browser HTTrack acl stayaway browser Microsoft.URL.Control acl stayaway browser ^Xenu acl stayaway browser LARBIN acl stayaway browser efp@gmx.net acl stayaway browser larbin acl stayaway browser ^LWP acl stayaway browser libwww-perl acl stayaway browser Python-urllib acl stayaway browser ^WorQmada acl stayaway browser ^TorQmada acl stayaway browser ^k2spider acl stayaway browser fetch.api.request acl stayaway browser Zealbot acl stayaway browser dloader acl stayaway browser NaverRobot acl stayaway browser Exalead acl stayaway browser Fetch acl stayaway browser Offline.Explorer acl stayaway browser WWW-Mechanize acl stayaway browser Downlad.Ninja acl stayaway browser Web.Downloader acl stayaway browser HTTrack acl stayaway browser Sister.Site acl stayaway browser WebReaper
Furthermore I know that any request without a User Agent is refused. There might be others, but someone who knows more about it than me should check that.
That hasn't been true for a week or so.
-- brion vibber (brion @ pobox.com)
Furthermore I know that any request without a User Agent is refused.
That hasn't been true for a week or so.
Ok, the solution of this riddle seems to be that the user had (mis-)configured his webwasher to send no user-agent. I read his message several days ago, so at the time we were probably still refusing access to browsers without user-agent.
Axel
wikitech-l@lists.wikimedia.org