Re: [Wikitech-l] Squid status codes, please advice

12 Oct 2009


      On Sun, Oct 11, 2009 at 6:03 PM, Aryeh Gregor
Simetrical+wikilist@gmail.com wrote:
...
On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte erikzachte@infodisiac.com wrote:
...
A
Any idea why there are so many TCP_DENIED/403, are these really failures ?
Certain types of requests are blocked at the Squid level for various
reasons.  For instance, try wgetting Wikipedia; you'll get a 403
because the default UA headers for such things are blocked.  (You're
supposed to use a custom UA header, preferably with contact info, to
make your script distinctive and easily blockable by itself if there's
a problem.)  Similarly, try something like this:
http://en.wikipedia.org/&;
I assume this kind of thing is what causes those responses.
Actually wget isn't blocked for either pageviews or action=edit based
on a test a minute ago.
...
On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde rarohde@gmail.com wrote:
...
However, a logical guess would
be if the Squid is configured to reject action=edit requests from
search engine spiders and similar non-human processes.  Since such
things are not easily incorporated into robots.txt, blocking at the
squid layer would be a good option for stopping such traffic from
hitting the main servers.  That would be my guess.  I suspect others
can give a more concrete answer.
Those things are all blocked in robots.txt:
User-agent: *
Disallow: /w/
That's part of why we use long URLs for everything but page views, so
that they can be neatly blocked from spiders.
Excellent point, though I wouldn't be surprised to find that some
disrespectful spiders and bots are also blocked at the squid level.
-Robert Rohde

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Squid status codes, please advice