On Tue, Oct 4, 2011 at 3:48 PM, Roan Kattouw <roan.kattouw(a)gmail.com> wrote:
There seem to have been a lot of page views
concentrated around
September 22-26. This could be something as innocent as someone
running a broken bot that's supposed to fetch lots of different
articles but instead fetches the same URL again and again due to a
typo in the code, or it could be as malicious as someone trying to DoS
us in a very simplistic way. I'll look at the sampled logs for those
days and see what I can find.
I've grepped the sampled (1:1000) Squid logs for
September 23rd, 24th
and 25th, and I do indeed see that a vast, vast majority of requests
for that article come from a single IP. In fact, I got output like
this (IP addresses redacted for privacy reasons):
$ zgrep Mathematical_descriptions_of_opacity
sampled-1000.log-20110924.gz | cut -d ' ' -f 5 | sort | uniq -c | sort
-rn | head
1548 AA.BB.CC.DD
1 EE.FF.GG.HH
1 JJ.KK.LL.MM
which means that in the sampled log (we don't keep full access logs,
only a 1:1000 sample) for September 24th, of the 1550 logged requests,
1548 came from our guy and 2 came from different, random people. This
doesn't mean there were only 1550 visits to that page that day; due to
the sampling, the real number is roughly near 1550*1000 = 1.55
million, which matches the 1.6M reported by stats.grok.se well enough.
Also, these requests all list
http://en.wikipedia.org/wiki/Snell%27s_law as their referer:
$ zgrep Mathematical_descriptions_of_opacity
sampled-1000.log-20110924.gz | grep Snell | wc -l
1548
but the Snell's law article doesn't show any strange access patterns
on stats.grok.se .
So I guess this was just one IP hitting the same article ~1.5 million
times per day for 3-4 days, for whatever reason. That doesn't really
hurt our servers much (unless the article is also edited heavily in
the meantime and contains complex templates that take a long time to
parse, see Jackson, Michael) but obviously it does skew the traffic
stats.
Roan