The idea is to select edit and submit calls that are relevant to the usability project and track edit/save ratio of filtered calls over time. Bots will be filtered, "action=edit&redlink=1,.." will be discarded (as 95% inadvertent edit calls), and some more. I would appreciate help in decoding most occurring squid/html statuses:
Here are the relevant html codes from FAQ: http://wiki.squid-cache.org/SquidFaq/SquidLogs Also http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
000 Used mostly with UDP traffic. 200 OK 206 Partial Content 301 Moved Permanently 302 Moved Temporarily 400 Bad Request 403 Forbidden 404 Not Found [417 Expectation Failed] 500 Internal Server Error 502 Bad Gateway 503 Service Unavailable 504 Gateway Timeout
The following are frequencies in which the index.php result codes are found in the 1:1000 sampled squid logs from just over 6 months:
TCP_DENIED/403,action=edit 321390 TCP_DENIED/403,action=submit 33
TCP_MISS/000,action=edit 7352 TCP_MISS/000,action=submit 1186
TCP_MISS/200,action=edit 800200 TCP_MISS/200,action=submit 75768
TCP_MISS/206,action=edit 20 TCP_MISS/206,action=submit 269
TCP_MISS/301,action=edit 662
TCP_MISS/302,action=edit 184217 TCP_MISS/302,action=submit 116141
TCP_MISS/400,action=edit 6 TCP_MISS/403,action=edit 2746 TCP_MISS/404,action=edit 119 TCP_MISS/404,action=submit 206 TCP_MISS/417,action=edit 53 TCP_MISS/417,action=submit 716
TCP_MISS/500,action=edit 362 TCP_MISS/500,action=submit 81 TCP_MISS/502,action=submit 87 TCP_MISS/503,action=edit 7 TCP_MISS/503,action=submit 5878 TCP_MISS/504,action=edit 53 TCP_MISS/504,action=submit 91
Out of these most significant given range and/or frequency are:
TCP_DENIED/403,action=edit 321390 TCP_MISS/000,action=edit 7352 TCP_MISS/200,action=edit 800200 TCP_MISS/302,action=edit 184217
TCP_MISS/000,action=submit 1186 TCP_MISS/200,action=submit 75768 TCP_MISS/302,action=submit 116141
Specific questions:
A Any idea why there are so many TCP_DENIED/403, are these really failures ?
B For action=submit the difference between preview and save is in the result codes right ? I understood earlier that TCP_MISS/302 is a successful save, right ? Does that mean TCP_MISS/200 is preview ?
C For action=edit how to interpret /200 vs /302 ?
D (minor) Are TCP/000 indeed (invalid) UDP messages ?
Erik Zachte
BTW For all squid status codes from Wikimedia servers see http://stats.wikimedia.org/wikimedia/squids/SquidReportMethods.htm
On Sun, Oct 11, 2009 at 12:28 PM, Erik Zachte erikzachte@infodisiac.com wrote: <snip>
A Any idea why there are so many TCP_DENIED/403, are these really failures ?
TCP_DENIED is usually used for requests that the Squid is configured to reject at the ACL level without even attempting to contact upstream servers.
I'm not sure where the squid configuration files for Wikimedia actually live. Hopefully someone who does know will be able to give you a precise answer to your question. However, a logical guess would be if the Squid is configured to reject action=edit requests from search engine spiders and similar non-human processes. Since such things are not easily incorporated into robots.txt, blocking at the squid layer would be a good option for stopping such traffic from hitting the main servers. That would be my guess. I suspect others can give a more concrete answer.
B For action=submit the difference between preview and save is in the result codes right ? I understood earlier that TCP_MISS/302 is a successful save, right ?
Typically.
Does that mean TCP_MISS/200 is preview ?
Preview, show changes, and aborted saves (e.g. saves stopped by edit conflicts and similar problems)
C For action=edit how to interpret /200 vs /302 ?
I don't know when action=edit would give a 302. It is obviously very common, but my attempts to guess where it would come up have failed. If you can grab some examples of URLs generating the 302 response it might become clear quickly.
D (minor) Are TCP/000 indeed (invalid) UDP messages ?
No idea.
-Robert Rohde
2009/10/12 Robert Rohde rarohde@gmail.com:
B For action=submit the difference between preview and save is in the result codes right ? I understood earlier that TCP_MISS/302 is a successful save, right ?
Upon a successful save, action=submit uses a 302 to redirect to the page view of the newly updated/created article. So typically, a successful request to /w/index.php?title=Example&action=submit will redirect to /wiki/Example using a 302.
For action=edit how to interpret /200 vs /302 ?
I don't know when action=edit would give a 302. It is obviously very common, but my attempts to guess where it would come up have failed. If you can grab some examples of URLs generating the 302 response it might become clear quickly.
URLs with &redlink=1 redirect to the page view with a 302.
Roan Kattouw (Catrope)
On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte erikzachte@infodisiac.com wrote:
A Any idea why there are so many TCP_DENIED/403, are these really failures ?
Certain types of requests are blocked at the Squid level for various reasons. For instance, try wgetting Wikipedia; you'll get a 403 because the default UA headers for such things are blocked. (You're supposed to use a custom UA header, preferably with contact info, to make your script distinctive and easily blockable by itself if there's a problem.) Similarly, try something like this:
I assume this kind of thing is what causes those responses.
On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde rarohde@gmail.com wrote:
However, a logical guess would be if the Squid is configured to reject action=edit requests from search engine spiders and similar non-human processes. Since such things are not easily incorporated into robots.txt, blocking at the squid layer would be a good option for stopping such traffic from hitting the main servers. That would be my guess. I suspect others can give a more concrete answer.
Those things are all blocked in robots.txt:
User-agent: * Disallow: /w/
That's part of why we use long URLs for everything but page views, so that they can be neatly blocked from spiders.
On Sun, Oct 11, 2009 at 6:03 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte erikzachte@infodisiac.com wrote:
A Any idea why there are so many TCP_DENIED/403, are these really failures ?
Certain types of requests are blocked at the Squid level for various reasons. For instance, try wgetting Wikipedia; you'll get a 403 because the default UA headers for such things are blocked. (You're supposed to use a custom UA header, preferably with contact info, to make your script distinctive and easily blockable by itself if there's a problem.) Similarly, try something like this:
I assume this kind of thing is what causes those responses.
Actually wget isn't blocked for either pageviews or action=edit based on a test a minute ago.
On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde rarohde@gmail.com wrote:
However, a logical guess would be if the Squid is configured to reject action=edit requests from search engine spiders and similar non-human processes. Since such things are not easily incorporated into robots.txt, blocking at the squid layer would be a good option for stopping such traffic from hitting the main servers. That would be my guess. I suspect others can give a more concrete answer.
Those things are all blocked in robots.txt:
User-agent: * Disallow: /w/
That's part of why we use long URLs for everything but page views, so that they can be neatly blocked from spiders.
Excellent point, though I wouldn't be surprised to find that some disrespectful spiders and bots are also blocked at the squid level.
-Robert Rohde
Hi!
Any idea why there are so many TCP_DENIED/403, are these really failures ?
99% of TCP_DENIED requests for action=edit has & in URL (broken clients)
B For action=submit the difference between preview and save is in the result codes right ? I understood earlier that TCP_MISS/302 is a successful save, right ? Does that mean TCP_MISS/200 is preview ?
TCP_MISS/200 can be for previews, edit conflicts, filter actions, anything what adds more steps to save, etc
C For action=edit how to interpret /200 vs /302 ?
&redlink=1
D (minor) Are TCP/000 indeed (invalid) UDP messages ?
Nope, must be something else.
Domas
wikitech-l@lists.wikimedia.org