This is something we should discuss. Right now, by refusing access to people who don't send a user agent, we may be messing some people up who are coming from behind a proxy? That's what I understand him to be saying.
Ideally, when we get our capacity up, we'll let the robots run free.
----- Forwarded message from Phil Howard phil@ipal.net -----
From: Phil Howard phil@ipal.net Date: Wed, 14 Jan 2004 12:41:16 -0600 To: Jimmy Wales jwales@bomis.com Subject: Re: Who do I contact when webmaster@wikipedia.org does not work?
On Wed, Jan 14, 2004 at 06:56:02AM -0800, Jimmy Wales wrote:
| I am told that this is because you don't send a user-agent string. | We block those requests because they are usually from badly-written | spiders that clobber our site.
Ah. That would be the case.
RFCs require that if a web cacher sends a user-agent string, each received document must be cached separately on the basis of that entire string so that a new request for the very same document from a different browser does not get the one previously requested. Thus if 20 different users go to a given web site from 20 different browsers, the cache has to access the origin server all 20 times, and no cache benefit is obtained, even for the image files. Given that browsers not only put major version numbers, but also minor version numbers, and in many cases even build numbers/dates, in the user agent string, that makes it appear to be a lot of different browsers for the purpose of caching.
Here's what my broswer does send: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030728 Mozilla Firebird/0.6.1 I think that's excessive, but that is what is going on.
I think you need to reconsider the issue with user-agent. There are a few commercial web caches that do this to enhance performance for businesses that are on slower network links. And I don't see any way to make it make an exception for certain sites.
Maybe I can hack the code in Squid (which is what we use here) to make it add a generic user-agent string in place of the deleted one. If you decide you cannot change your action for lack of user-agent, then clearly the step to be taken is to fake one that can't be interpreted as any specific browser. User-Agent: generic cache
| But I notice that the screenshot is of mozilla firebird. May I | share that link with the developers?
Sure. If by developers you mean the mozilla developers, be sure to tell them it's not a mozilla issue; it's a caching issue.
BTW, I had heard that some spiders specifically omitted user-agent so that they could be sure to cache something generic for every browser. Maybe they will be going to the fake user-agent string, too.
So why not at least put up a page that says why the site cannot be accessed? Just don't put any links in that page for any spiders to continue following so they cache/index only that one, and leave. Surely they are not accessing the same page over and over and over because it didn't have an error status.
| | | Jimmy Wales wrote: | | > wow that must have been a very temporary thing, let me know if you see | > it again. | > | > Phil Howard wrote: | > | > > On Tue, Jan 13, 2004 at 04:26:55AM -0800, Jimmy Wales wrote: | > > | > > | Phil Howard wrote: | > > | > Who do I contact when webmaster@wikipedia.org does not work? | > > | | > > | Me. Hi, what's up? | > > | > > webmaster@wikipedia.org did not work, which is what I first tried when I | > > could not access the web site. | > > | > > I get a 403 permission denied. Here is a screenshot: | > > http://phil.ipal.org/wikipedia.png | > > | > > Here's the bounce from the original message I sent: | > > | > > ============================================================================= | > > [-- Attachment #1: Notification --] | > > [-- Type: text/plain, Encoding: 7bit, Size: 0.5K --] | > > | > > This is the Postfix program at host vega.ipal.net. | > > | > > I'm sorry to have to inform you that the message returned | > > below could not be delivered to one or more destinations. | > > | > > For further assistance, please send mail to <postmaster> | > > | > > If you do so, please include this problem report. You can | > > delete your own text from the message returned below. | > > | > > The Postfix program | > > | > > webmaster@wikipedia.org: host mail.wikipedia.org[130.94.122.197] said: 550 | > > webmaster@wikipedia.org: User unknown in local recipient table (in reply | > > to RCPT TO command) | > > | > > [-- Attachment #2: Delivery error report --] | > > [-- Type: message/delivery-status, Encoding: 7bit, Size: 0.3K --] | > > | > > Reporting-MTA: dns; vega.ipal.net | > > Arrival-Date: Mon, 12 Jan 2004 23:31:44 -0600 (CST) | > > | > > Final-Recipient: rfc822; webmaster@wikipedia.org | > > Action: failed | > > Status: 5.0.0 | > > Diagnostic-Code: X-Postfix; host mail.wikipedia.org[130.94.122.197] said: 550 | > > webmaster@wikipedia.org: User unknown in local recipient table (in reply | > > to RCPT TO command) | > > | > > [-- Attachment #3: Undelivered Message --] | > > [-- Type: message/rfc822, Encoding: 7bit, Size: 0.8K --] | > > | > > Date: Mon, 12 Jan 2004 23:31:44 -0600 | > > From: Phil Howard phil@ipal.net | > > To: webmaster@wikipedia.org | > > Subject: still not working | > > User-Agent: Mutt/1.4.1i | > > | > > Here's the screenshot: http://phil.ipal.org/wikipedia.png | > > | > > -- | > > ----------------------------------------------------------------------------- | > > | Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ | | > > | (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ | | > > ----------------------------------------------------------------------------- | > > ============================================================================= | > > | > > -- | > > ----------------------------------------------------------------------------- | > > | Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ | | > > | (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ | | > > ----------------------------------------------------------------------------- | > >