On Sun, Oct 11, 2009 at 6:03 PM, Aryeh Gregor
<Simetrical+wikilist(a)gmail.com> wrote:
On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte
<erikzachte(a)infodisiac.com> wrote:
A
Any idea why there are so many TCP_DENIED/403, are these really failures ?
Certain types of requests are blocked at the Squid level for various
reasons. For instance, try wgetting Wikipedia; you'll get a 403
because the default UA headers for such things are blocked. (You're
supposed to use a custom UA header, preferably with contact info, to
make your script distinctive and easily blockable by itself if there's
a problem.) Similarly, try something like this:
http://en.wikipedia.org/&amp;
I assume this kind of thing is what causes those responses.
Actually wget isn't blocked for either pageviews or action=edit based
on a test a minute ago.
On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde
<rarohde(a)gmail.com> wrote:
However, a logical guess would
be if the Squid is configured to reject action=edit requests from
search engine spiders and similar non-human processes. Since such
things are not easily incorporated into robots.txt, blocking at the
squid layer would be a good option for stopping such traffic from
hitting the main servers. That would be my guess. I suspect others
can give a more concrete answer.
Those things are all blocked in robots.txt:
User-agent: *
Disallow: /w/
That's part of why we use long URLs for everything but page views, so
that they can be neatly blocked from spiders.
Excellent point, though I wouldn't be surprised to find that some
disrespectful spiders and bots are also blocked at the squid level.
-Robert Rohde