On Sun, Oct 11, 2009 at 6:03 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte erikzachte@infodisiac.com wrote:
A Any idea why there are so many TCP_DENIED/403, are these really failures ?
Certain types of requests are blocked at the Squid level for various reasons. For instance, try wgetting Wikipedia; you'll get a 403 because the default UA headers for such things are blocked. (You're supposed to use a custom UA header, preferably with contact info, to make your script distinctive and easily blockable by itself if there's a problem.) Similarly, try something like this:
I assume this kind of thing is what causes those responses.
Actually wget isn't blocked for either pageviews or action=edit based on a test a minute ago.
On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde rarohde@gmail.com wrote:
However, a logical guess would be if the Squid is configured to reject action=edit requests from search engine spiders and similar non-human processes. Since such things are not easily incorporated into robots.txt, blocking at the squid layer would be a good option for stopping such traffic from hitting the main servers. That would be my guess. I suspect others can give a more concrete answer.
Those things are all blocked in robots.txt:
User-agent: * Disallow: /w/
That's part of why we use long URLs for everything but page views, so that they can be neatly blocked from spiders.
Excellent point, though I wouldn't be surprised to find that some disrespectful spiders and bots are also blocked at the squid level.
-Robert Rohde