Lars,
I think you are overdoing it.
The reports are not nonsense, but have over time become more inaccurate than some other
stats we present.
Actually if the reports would have mentioned 'pages served' rather than 'page
views' they still would be spot on.
Of course I also would have hoped this filter to be implemented now.
But sometimes projects take longer than planned, at WMF like everywhere else.
The stats still show a breakdown per language, and relative growth, assuming bot activity
is more or less consistent from one month to another (of course not over longer periods).
Last quote I got (in April?) is that overall 40% of traffic is bot related. That could be
more now.
Erik
-----Original Message-----
From: Lars Aronsson [mailto:lars@aronsson.se]
Sent: Thursday, February 14, 2013 1:28 AM
To: Erik Zachte
Cc: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in
Wikipedia and analytics.'; Wikimedia developers
Subject: Re: [Analytics] Fwd: [Wikitech-l] Page view stats we can believe in
Hi Erik,
You're quite right numbers are inflated, and
we've been over this before [1].
Below are some sampled data for da.wiktionary from webstatscollector
[2] and squid log [3] Bot traffic is a substantial share of 'page views' (but not
the majority as you suggest).
We discussed this extensively in April and as I remember (my mail
archive is somehow incomplete) decided to implement a second
cleaned-up stream without /bot/crawler/spider/http (keeping the
original stream so as not break trend lines)
However that bot free stream (projectcounts files with extra set of
per wiki totals) never happened yet, and I'm pretty sure we changed
plans since, and probably now wait for Kraken. Diederik can you add to this?
Oh my, I thought this was in operation already.
I've actually been looking at these page view stats, and now I feel like a fool.
Why not just remove these web pages at
http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm
since they contain only nonsense? Continuity with old nonsense is still nonsense, so
remove everything now and start a new project with real numbers.
[1] On April 8, 2012 you reported a similar issue for
Swedish Wikipedia.
I checked by then one hour of sampled squid log. 9 out of 13 requests were bots.
Nobody doubts that the Swedish Wikipedia has a substantial amount of human traffic. But
for smaller projects, the background noise will dominate. If bots are 9 out of 13 requests
to sv.wikipedia (really?), they can easily be 99% of traffic to da.wiktionary.
One easy way to tell would be to observe the daily rhythm. Since Swedish and Danish are
limited to one timezone, traffic in the middle of the night should be much smaller than
mid-day traffic. But bots could be operating all night, all day. So the least active hour
is probably the background noise from bots.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se