On Tue, Oct 4, 2011 at 3:48 PM, Roan Kattouw roan.kattouw@gmail.com wrote:
There seem to have been a lot of page views concentrated around September 22-26. This could be something as innocent as someone running a broken bot that's supposed to fetch lots of different articles but instead fetches the same URL again and again due to a typo in the code, or it could be as malicious as someone trying to DoS us in a very simplistic way. I'll look at the sampled logs for those days and see what I can find.
I've grepped the sampled (1:1000) Squid logs for September 23rd, 24th and 25th, and I do indeed see that a vast, vast majority of requests for that article come from a single IP. In fact, I got output like this (IP addresses redacted for privacy reasons):
$ zgrep Mathematical_descriptions_of_opacity sampled-1000.log-20110924.gz | cut -d ' ' -f 5 | sort | uniq -c | sort -rn | head 1548 AA.BB.CC.DD 1 EE.FF.GG.HH 1 JJ.KK.LL.MM
which means that in the sampled log (we don't keep full access logs, only a 1:1000 sample) for September 24th, of the 1550 logged requests, 1548 came from our guy and 2 came from different, random people. This doesn't mean there were only 1550 visits to that page that day; due to the sampling, the real number is roughly near 1550*1000 = 1.55 million, which matches the 1.6M reported by stats.grok.se well enough.
Also, these requests all list http://en.wikipedia.org/wiki/Snell%27s_law as their referer: $ zgrep Mathematical_descriptions_of_opacity sampled-1000.log-20110924.gz | grep Snell | wc -l 1548 but the Snell's law article doesn't show any strange access patterns on stats.grok.se .
So I guess this was just one IP hitting the same article ~1.5 million times per day for 3-4 days, for whatever reason. That doesn't really hurt our servers much (unless the article is also edited heavily in the meantime and contains complex templates that take a long time to parse, see Jackson, Michael) but obviously it does skew the traffic stats.
Roan