I'm forwarding to the Analytics list, which is a better place to discuss this.
Matt Flaschen
-------- Original Message -------- Subject: [Wikitech-l] Page view stats we can believe in Date: Wed, 13 Feb 2013 22:18:44 +0100 From: Lars Aronsson lars@aronsson.se Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org To: Wikimedia developers wikitech-l@lists.wikimedia.org
I stumbled on the Danish Wiktionary, of all projects. Danish is the 68th biggest language of Wiktionary, and has a little more than 8,000 articles in total. Most of these articles are very short and provide no value to a reader. There is no reason to link to them, and so very unlikely that the next user should stumble upon them unless they are me.
Yet, wikistats tries to make be believe that this tiny project has 400,000 or 500,000 page views each month, and has had so for a long time, http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm
(I'm not talking about January 2012, which seems to have been an error, and reports 2-3 times that many views.)
My guess is that da.wiktionary has 4,000 page views per month, not 400,000. It's more likely that 400,000 is some background noise, an offset number that should be subtracted from the number of page views for any project.
If you look at the log files for just one day, you should see my IP address (85.228.something) and 3-4 other users who have been editing lately, and not many more people, but perhaps a bunch of interwiki bots.
We need an explanation to these vastly inflated page view statistics.
Lars,
You're quite right numbers are inflated, and we've been over this before [1]. Below are some sampled data for da.wiktionary from webstatscollector [2] and squid log [3] Bot traffic is a substantial share of 'page views' (but not the majority as you suggest).
We discussed this extensively in April and as I remember (my mail archive is somehow incomplete) decided to implement a second cleaned-up stream without /bot/crawler/spider/http (keeping the original stream so as not break trend lines)
However that bot free stream (projectcounts files with extra set of per wiki totals) never happened yet, and I'm pretty sure we changed plans since, and probably now wait for Kraken. Diederik can you add to this?
Cheers,
Erik
[1] On April 8, 2012 you reported a similar issue for Swedish Wikipedia. I checked by then one hour of sampled squid log. 9 out of 13 requests were bots.
[2] I just checked hourly page views reported by webstatcollector [2]: Yesterday hourly average was 619. Monthly would then be 445K. So based on this file (which feeds the report) we really seem to get this many messages. You can check at http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-02/ grep on da.d (dictionary) in projectcount files
[3] 1:1000 sampled squid log for Jan 31 has 15 lines with da.wiktionary and html. So that matches nicely with projectcounts (619*24=14856 per day or +/- 15 in 1:1000 sampled log) 9 seem legit browser requests, 4 are google bot, 1 feedfetcher
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Matthew Flaschen Sent: Wednesday, February 13, 2013 10:37 PM To: analytics@lists.wikimedia.org; lars@aronsson.se Subject: [Analytics] Fwd: [Wikitech-l] Page view stats we can believe in
I'm forwarding to the Analytics list, which is a better place to discuss this.
Matt Flaschen
-------- Original Message -------- Subject: [Wikitech-l] Page view stats we can believe in Date: Wed, 13 Feb 2013 22:18:44 +0100 From: Lars Aronsson lars@aronsson.se Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org To: Wikimedia developers wikitech-l@lists.wikimedia.org
I stumbled on the Danish Wiktionary, of all projects. Danish is the 68th biggest language of Wiktionary, and has a little more than 8,000 articles in total. Most of these articles are very short and provide no value to a reader. There is no reason to link to them, and so very unlikely that the next user should stumble upon them unless they are me.
Yet, wikistats tries to make be believe that this tiny project has 400,000 or 500,000 page views each month, and has had so for a long time, http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm
(I'm not talking about January 2012, which seems to have been an error, and reports 2-3 times that many views.)
My guess is that da.wiktionary has 4,000 page views per month, not 400,000. It's more likely that 400,000 is some background noise, an offset number that should be subtracted from the number of page views for any project.
If you look at the log files for just one day, you should see my IP address (85.228.something) and 3-4 other users who have been editing lately, and not many more people, but perhaps a bunch of interwiki bots.
We need an explanation to these vastly inflated page view statistics.
Two corrections:
Bot traffic is a substantial share of 'page views' (but not the >vast< majority as you suggest). (in one of two examples below bots are actually the majority)
8< seem legit browser requests, 4 are google bot, >1 another bot<, 1 feedfetcher
-----Original Message----- From: Erik Zachte [mailto:ezachte@wikimedia.org] Sent: Thursday, February 14, 2013 12:40 AM To: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'; lars@aronsson.se Subject: RE: [Analytics] Fwd: [Wikitech-l] Page view stats we can believe in
Lars,
You're quite right numbers are inflated, and we've been over this before [1]. Below are some sampled data for da.wiktionary from webstatscollector [2] and squid log [3] Bot traffic is a substantial share of 'page views' (but not the majority as you suggest).
We discussed this extensively in April and as I remember (my mail archive is somehow incomplete) decided to implement a second cleaned-up stream without /bot/crawler/spider/http (keeping the original stream so as not break trend lines)
However that bot free stream (projectcounts files with extra set of per wiki totals) never happened yet, and I'm pretty sure we changed plans since, and probably now wait for Kraken. Diederik can you add to this?
Cheers,
Erik
[1] On April 8, 2012 you reported a similar issue for Swedish Wikipedia. I checked by then one hour of sampled squid log. 9 out of 13 requests were bots.
[2] I just checked hourly page views reported by webstatcollector [2]: Yesterday hourly average was 619. Monthly would then be 445K. So based on this file (which feeds the report) we really seem to get this many messages. You can check at http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-02/ grep on da.d (dictionary) in projectcount files
[3] 1:1000 sampled squid log for Jan 31 has 15 lines with da.wiktionary and html. So that matches nicely with projectcounts (619*24=14856 per day or +/- 15 in 1:1000 sampled log) 9 seem legit browser requests, 4 are google bot, 1 feedfetcher
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Matthew Flaschen Sent: Wednesday, February 13, 2013 10:37 PM To: analytics@lists.wikimedia.org; lars@aronsson.se Subject: [Analytics] Fwd: [Wikitech-l] Page view stats we can believe in
I'm forwarding to the Analytics list, which is a better place to discuss this.
Matt Flaschen
-------- Original Message -------- Subject: [Wikitech-l] Page view stats we can believe in Date: Wed, 13 Feb 2013 22:18:44 +0100 From: Lars Aronsson lars@aronsson.se Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org To: Wikimedia developers wikitech-l@lists.wikimedia.org
I stumbled on the Danish Wiktionary, of all projects. Danish is the 68th biggest language of Wiktionary, and has a little more than 8,000 articles in total. Most of these articles are very short and provide no value to a reader. There is no reason to link to them, and so very unlikely that the next user should stumble upon them unless they are me.
Yet, wikistats tries to make be believe that this tiny project has 400,000 or 500,000 page views each month, and has had so for a long time, http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm
(I'm not talking about January 2012, which seems to have been an error, and reports 2-3 times that many views.)
My guess is that da.wiktionary has 4,000 page views per month, not 400,000. It's more likely that 400,000 is some background noise, an offset number that should be subtracted from the number of page views for any project.
If you look at the log files for just one day, you should see my IP address (85.228.something) and 3-4 other users who have been editing lately, and not many more people, but perhaps a bunch of interwiki bots.
We need an explanation to these vastly inflated page view statistics.
Hi Erik,
You're quite right numbers are inflated, and we've been over this before [1]. Below are some sampled data for da.wiktionary from webstatscollector [2] and squid log [3] Bot traffic is a substantial share of 'page views' (but not the majority as you suggest).
We discussed this extensively in April and as I remember (my mail archive is somehow incomplete) decided to implement a second cleaned-up stream without /bot/crawler/spider/http (keeping the original stream so as not break trend lines)
However that bot free stream (projectcounts files with extra set of per wiki totals) never happened yet, and I'm pretty sure we changed plans since, and probably now wait for Kraken. Diederik can you add to this?
Oh my, I thought this was in operation already. I've actually been looking at these page view stats, and now I feel like a fool.
Why not just remove these web pages at http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm since they contain only nonsense? Continuity with old nonsense is still nonsense, so remove everything now and start a new project with real numbers.
[1] On April 8, 2012 you reported a similar issue for Swedish Wikipedia. I checked by then one hour of sampled squid log. 9 out of 13 requests were bots.
Nobody doubts that the Swedish Wikipedia has a substantial amount of human traffic. But for smaller projects, the background noise will dominate. If bots are 9 out of 13 requests to sv.wikipedia (really?), they can easily be 99% of traffic to da.wiktionary.
One easy way to tell would be to observe the daily rhythm. Since Swedish and Danish are limited to one timezone, traffic in the middle of the night should be much smaller than mid-day traffic. But bots could be operating all night, all day. So the least active hour is probably the background noise from bots.
Lars,
I think you are overdoing it. The reports are not nonsense, but have over time become more inaccurate than some other stats we present. Actually if the reports would have mentioned 'pages served' rather than 'page views' they still would be spot on.
Of course I also would have hoped this filter to be implemented now. But sometimes projects take longer than planned, at WMF like everywhere else.
The stats still show a breakdown per language, and relative growth, assuming bot activity is more or less consistent from one month to another (of course not over longer periods).
Last quote I got (in April?) is that overall 40% of traffic is bot related. That could be more now.
Erik
-----Original Message----- From: Lars Aronsson [mailto:lars@aronsson.se] Sent: Thursday, February 14, 2013 1:28 AM To: Erik Zachte Cc: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'; Wikimedia developers Subject: Re: [Analytics] Fwd: [Wikitech-l] Page view stats we can believe in
Hi Erik,
You're quite right numbers are inflated, and we've been over this before [1]. Below are some sampled data for da.wiktionary from webstatscollector [2] and squid log [3] Bot traffic is a substantial share of 'page views' (but not the majority as you suggest).
We discussed this extensively in April and as I remember (my mail archive is somehow incomplete) decided to implement a second cleaned-up stream without /bot/crawler/spider/http (keeping the original stream so as not break trend lines)
However that bot free stream (projectcounts files with extra set of per wiki totals) never happened yet, and I'm pretty sure we changed plans since, and probably now wait for Kraken. Diederik can you add to this?
Oh my, I thought this was in operation already. I've actually been looking at these page view stats, and now I feel like a fool.
Why not just remove these web pages at http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm since they contain only nonsense? Continuity with old nonsense is still nonsense, so remove everything now and start a new project with real numbers.
[1] On April 8, 2012 you reported a similar issue for Swedish Wikipedia. I checked by then one hour of sampled squid log. 9 out of 13 requests were bots.
Nobody doubts that the Swedish Wikipedia has a substantial amount of human traffic. But for smaller projects, the background noise will dominate. If bots are 9 out of 13 requests to sv.wikipedia (really?), they can easily be 99% of traffic to da.wiktionary.
One easy way to tell would be to observe the daily rhythm. Since Swedish and Danish are limited to one timezone, traffic in the middle of the night should be much smaller than mid-day traffic. But bots could be operating all night, all day. So the least active hour is probably the background noise from bots.
On 02/14/2013 02:56 AM, Erik Zachte wrote:
Lars,
I think you are overdoing it. The reports are not nonsense, but have over time become more inaccurate than some other stats we present. Actually if the reports would have mentioned 'pages served' rather than 'page views' they still would be spot on.
Noooo, nobody in the web business counts bot accesses. Pages, page views, are human page views. You need to filter out bots, API calls, and non-page fetches. The main Wikistats, counting articles and users is very accurate, and these nonsense page view stats give Wikistats a bad name. Plus they are used by all the GLAM projects to show museums how much people view pictures from their museum, and now that's all fake and exaggeration. It's 2-3 years wasted. Please don't waste any more years or months of our time. We now have to go back to museums and apologize.
The stats still show a breakdown per language,
No, that's exactly what fails. Wikistats indicates that Wiktionary has more page views than Wikisource, and believed this, and it surprised me, but now I understand that we are counting bots that follow red links, and that is a sport Wiktionary will always win. Humans tend to read Wikisource, but bots are drawn to spend time in the link mazes of Wiktionary.
and relative growth, assuming bot activity is more or less consistent from one month to another (of course not over longer periods).
Last quote I got (in April?) is that overall 40% of traffic is bot related. That could be more now.
And it's far more for smaller projects, and for link-intensive Wiktionary, and for those languages of Wikipedia that create articles by bots, such as Dutch, Swedish, Vietnamese and Volapük.
This bot-created article about a spider "has been viewed 12 times in the last 30 days", but only by bots?
http://nl.wikipedia.org/wiki/Acantheis_variatus http://stats.grok.se/nl/latest/Acantheis_variatus
Bots creating articles and bots reading them, what a joke! And they are creating articles about spiders!
Lars,
I can feel your pain. I also feel it from time to time when reality is at odds with expectations. What I should have done is put up a notice to explain this anomaly, to avoid confusion. My bad.
Like I said: we planned to implement a second data stream. In fact I heard there was a patch to do just that, but the server couldn't handle it, too much packet loss, so it was retracted.
Yeah it gives wikistats a bad name. You're right. The chain is as strong as the weakest link.
Now we're at it: I recently discovered page counts per article do not include access to the mobile site. This is being discussed, it seems not so easy to fix. Bring out the torches.
Erik
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Lars Aronsson Sent: Thursday, February 14, 2013 4:03 AM To: Erik Zachte Cc: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'; 'Wikimedia developers' Subject: Re: [Wikitech-l] [Analytics] Fwd: Page view stats we can believe in
On 02/14/2013 02:56 AM, Erik Zachte wrote:
Lars,
I think you are overdoing it. The reports are not nonsense, but have over time become more inaccurate than some other stats we present. Actually if the reports would have mentioned 'pages served' rather than 'page views' they still would be spot on.
Noooo, nobody in the web business counts bot accesses. Pages, page views, are human page views. You need to filter out bots, API calls, and non-page fetches. The main Wikistats, counting articles and users is very accurate, and these nonsense page view stats give Wikistats a bad name. Plus they are used by all the GLAM projects to show museums how much people view pictures from their museum, and now that's all fake and exaggeration. It's 2-3 years wasted. Please don't waste any more years or months of our time. We now have to go back to museums and apologize.
The stats still show a breakdown per language,
No, that's exactly what fails. Wikistats indicates that Wiktionary has more page views than Wikisource, and believed this, and it surprised me, but now I understand that we are counting bots that follow red links, and that is a sport Wiktionary will always win. Humans tend to read Wikisource, but bots are drawn to spend time in the link mazes of Wiktionary.
and relative growth, assuming bot activity is more or less consistent from one month to another (of course not over longer periods).
Last quote I got (in April?) is that overall 40% of traffic is bot related. That could be more now.
And it's far more for smaller projects, and for link-intensive Wiktionary, and for those languages of Wikipedia that create articles by bots, such as Dutch, Swedish, Vietnamese and Volapük.
This bot-created article about a spider "has been viewed 12 times in the last 30 days", but only by bots?
http://nl.wikipedia.org/wiki/Acantheis_variatus http://stats.grok.se/nl/latest/Acantheis_variatus
Bots creating articles and bots reading them, what a joke! And they are creating articles about spiders!
[resent with one extra clarification and from WMF account]
fix: page counts per article -> view counts per article as collected in 'pagecounts' files
Lars,
I can feel your pain. I also feel it from time to time when reality is at odds with expectations. What I should have done is put up a notice to explain this anomaly, to avoid confusion. My bad.
Like I said: we planned to implement a second data stream. In fact I heard there was a patch to do just that, but the server couldn't handle it, too much packet loss, so it was retracted.
Yeah it gives wikistats a bad name. You're right. The chain is as strong as the weakest link.
Now we're at it: I recently discovered view counts per article as collected in 'pagecounts' files do not include access to the mobile site. This is being discussed, it seems not so easy to fix. Bring out the torches.
Erik
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Lars Aronsson Sent: Thursday, February 14, 2013 4:03 AM To: Erik Zachte Cc: 'A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.'; 'Wikimedia developers' Subject: Re: [Wikitech-l] [Analytics] Fwd: Page view stats we can believe in
On 02/14/2013 02:56 AM, Erik Zachte wrote:
Lars,
I think you are overdoing it. The reports are not nonsense, but have over time become more inaccurate than some other stats we present. Actually if the reports would have mentioned 'pages served' rather than 'page views' they still would be spot on.
Noooo, nobody in the web business counts bot accesses. Pages, page views, are human page views. You need to filter out bots, API calls, and non-page fetches. The main Wikistats, counting articles and users is very accurate, and these nonsense page view stats give Wikistats a bad name. Plus they are used by all the GLAM projects to show museums how much people view pictures from their museum, and now that's all fake and exaggeration. It's 2-3 years wasted. Please don't waste any more years or months of our time. We now have to go back to museums and apologize.
The stats still show a breakdown per language,
No, that's exactly what fails. Wikistats indicates that Wiktionary has more page views than Wikisource, and believed this, and it surprised me, but now I understand that we are counting bots that follow red links, and that is a sport Wiktionary will always win. Humans tend to read Wikisource, but bots are drawn to spend time in the link mazes of Wiktionary.
and relative growth, assuming bot activity is more or less consistent from one month to another (of course not over longer periods).
Last quote I got (in April?) is that overall 40% of traffic is bot related. That could be more now.
And it's far more for smaller projects, and for link-intensive Wiktionary, and for those languages of Wikipedia that create articles by bots, such as Dutch, Swedish, Vietnamese and Volapük.
This bot-created article about a spider "has been viewed 12 times in the last 30 days", but only by bots?
http://nl.wikipedia.org/wiki/Acantheis_variatus http://stats.grok.se/nl/latest/Acantheis_variatus
Bots creating articles and bots reading them, what a joke! And they are creating articles about spiders!
Lars Aronsson, 14/02/2013 04:02:
On 02/14/2013 02:56 AM, Erik Zachte wrote:
Lars,
I think you are overdoing it. The reports are not nonsense, but have over time become more inaccurate than some other stats we present. Actually if the reports would have mentioned 'pages served' rather than 'page views' they still would be spot on.
Noooo, nobody in the web business counts bot accesses. Pages, page views, are human page views. You need to filter out bots, API calls, and non-page fetches. The main Wikistats, counting articles and users is very accurate, and these nonsense page view stats give Wikistats a bad name. Plus they are used by all the GLAM projects to show museums how much people view pictures from their museum, and now that's all fake and exaggeration. It's 2-3 years wasted. Please don't waste any more years or months of our time. We now have to go back to museums and apologize.
You're exaggerating a bit here I think; the first thing we tell potential GLAM partners is that we don't have any way to give meaningful stats (and yes, this is often the main deal-breaker). The only stats they care about, anyway, are often visitors coming from Wikimedia projects, which they measure themselves. As for meaningful stats, we've been using comScore for a long while and pageviews only for rough measure of total reach growth and for comparison between pages on the same project, not really to compare different projects (or other websites). Indeed comparing Wiktionary to Wikisource with this data makes no sense, thanks for reminding us.
Bots creating articles and bots reading them, what a joke! And they are creating articles about spiders!
LOL sv.wiki is indeed becoming a bot realm. ;) Why are those bots not using the API, by the way?
Nemo