I'm telling people that the Swedish Wikipedia has 90-100 million page views per month or on average ten per month per Swedish citizen. This is based on stats.wikimedia.org (Wikistats), but is it really true? It would be really embarrassing if it were wrong by some order of magnitude.
There is of course a difference between the language and the country. Another measure says Internet users in Sweden (some 90 percent of all citizens) make 16 page views to Wikipedia per month, including all languages. Both numbers 10 or 16 make sense. But are they correct?
Wikistats also says Swedish Wikisource has 300-400 thousand page views per month, which would be 10-13 thousand per day on average. Knowing how small the Swedish Wikisource is (only 16,000 wiki pages + 37,000 facsimile pages), and comparing to other Swedish language websites, I'm surprised that Swedish Wikisource could attract even this much traffic. Now we're at such a small scale, that reading through a day's logfile with 13,000 lines is realistic for a human.
Is there a chance WMF could publish the logfile for Swedish Wikisource for a typical day, with just the IP addresses anonymized? Plus the source code that counts the number of page views, by filtering out accesses from robot crawlers and accesses to non-pages (like images and style sheets).
Page views for individual pages (on stats.grok.se) shows the Main page of Swedish Wikisource is shown 120 times/day while Recent changes is shown 160 times/day. From my own experience, contributors are the only ones to look at Recent changes, while they almost never look at the Main page. If IP addresses are scrambled but not removed, the log file should be able to show this pattern. Is it possible to tell apart the IP addresses for contributors and non-contributors, and present page views from each category?
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' | awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
http://sv.wikisource.org/wiki/Snabbt_jagar_stormen_v%C3%A5ra_%C3%A5r,text/ht ml,Mozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_2)%20AppleWebKit/ 535.19%20(KHTML,%20like%20Gecko)%20Chrome/18.0.1025.142%20Safari/535.19 http://sv.wikisource.org/wiki/Special:Log?page=User:Sarahmaethomas,text/html ,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.ht ml) http://sv.wikisource.org/wiki/Underbar-k%C3%A4rlek-s%C3%A5-stor,text/html,Mo zilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.html) http://sv.wikisource.org/w/index.php?title=Diskussion%3aBer%c3%a4ttelser+ur+ svenska+historien%2fHedniska+tiden%2f2&redirect=no&action=raw&ctype=text/pla in&dontcountme=s,text/x-wiki,DotNetWikiBot/2.97%20(Microsoft%20Windows%20NT% 206.1.7601%20Service%20Pack%201;%20.NET%20CLR%202.0.50727.5448) http://sv.wikisource.org/wiki/Sida:SOU_1962_36.djvu/36,-,Mozilla/5.0%20(comp atible;%20Googlebot/2.1;%20+http://www.google.com/bot.html) http://sv.wikisource.org/wiki/Till_Polen,-,Mozilla/5.0%20(compatible;%20Goog lebot/2.1;%20+http://www.google.com/bot.html) http://sv.wikisource.org/wiki/Bibeln_1917/F%C3%B6rsta_Moseboken,text/html,Mo zilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_3)%20AppleWebKit/534.5 3.11%20(KHTML,%20like%20Gecko)%20Version/5.1.3%20Safari/534.53.10 http://sv.wikisource.org/wiki/Arbetare,text/html,Mozilla/5.0%20(compatible;% 20YandexBot/3.0;%20+http://yandex.com/bots) http://sv.wikisource.org/wiki/Industrin_och_kvinnofr,text/html,Mozilla/5.0%2 0(compatible;%20Baiduspider/2.0;%20+http://www.baidu.com/search/spider.html) http://sv.wikisource.org/wiki/Sida:Berzelius_Reseanteckningar_1903.djvu/120, text/html,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.c om/bot.html) http://sv.wikisource.org/wiki/Kategori:Ordspr%C3%A5ksboken,text/html,Mozilla /5.0%20(compatible;%20MSIE%209.0;%20Windows%20NT%206.1;%20Win64;%20x64;%20Tr ident/5.0) http://sv.wikisource.org/wiki/Special:L%C3%A4nkar_hit/Kategori:Karin_Boye,te xt/html,Mozilla/5.0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bot s) http://sv.wikisource.org/wiki/Sida:Om_arternas_uppkomst.djvu/235,-,Mozilla/5 .0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bots)
The page view report http://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthlyOriginal.htm is based on http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-04/ collected by http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/
By sheer coincidence we have been discussing filtering bots from the projectcounts files (at the source : webstatscollector) last week. Not a new discussion, but with current resources this may be feasible now, where it wasn't years ago.
Erik
-----Original Message----- From: Lars Aronsson [mailto:lars@aronsson.se] Sent: Sunday, April 08, 2012 2:24 AM To: Wikimedia developers Cc: Erik Zachte Subject: Page views
I'm telling people that the Swedish Wikipedia has 90-100 million page views per month or on average ten per month per Swedish citizen. This is based on stats.wikimedia.org (Wikistats), but is it really true? It would be really embarrassing if it were wrong by some order of magnitude.
There is of course a difference between the language and the country. Another measure says Internet users in Sweden (some 90 percent of all citizens) make 16 page views to Wikipedia per month, including all languages. Both numbers 10 or 16 make sense. But are they correct?
Wikistats also says Swedish Wikisource has 300-400 thousand page views per month, which would be 10-13 thousand per day on average. Knowing how small the Swedish Wikisource is (only 16,000 wiki pages + 37,000 facsimile pages), and comparing to other Swedish language websites, I'm surprised that Swedish Wikisource could attract even this much traffic. Now we're at such a small scale, that reading through a day's logfile with 13,000 lines is realistic for a human.
Is there a chance WMF could publish the logfile for Swedish Wikisource for a typical day, with just the IP addresses anonymized? Plus the source code that counts the number of page views, by filtering out accesses from robot crawlers and accesses to non-pages (like images and style sheets).
Page views for individual pages (on stats.grok.se) shows the Main page of Swedish Wikisource is shown 120 times/day while Recent changes is shown 160 times/day. From my own experience, contributors are the only ones to look at Recent changes, while they almost never look at the Main page. If IP addresses are scrambled but not removed, the log file should be able to show this pattern. Is it possible to tell apart the IP addresses for contributors and non-contributors, and present page views from each category?
2012/4/8 Erik Zachte ezachte@wikimedia.org
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' | awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
How many of that 1000 sample log were robots (including all languages)?
Here are some numbers on total bot burden:
1) http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states for March 2012:
In total 69.5 M page requests (mime type text/html only!) per day are considered crawler requests, out of 696 M page requests (10.0%) or 469 M external page requests (14.8%). About half (35.1 M) of crawler requests come from Google.
2) Here are counts from one day log, as sanity check:
zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P '/wiki/|index.php' | grep -cP ' - |text/html' => 678325
zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P '/wiki/|index.php' | grep -P ' - |text/html' | grep -ciP 'bot|crawler|spider' => 68027
68027 / 678325 = 10.0% which matches really well with numbers from SquidReportCrawlers.htm
---
My suggestion for how to filter these bots efficiently in c program (no costly nuanced regexps) before sending data to webstatscollector:
a) Find 14th field in space delimited log line = user agent (but beware of false delimiters in logs from varnish, if still applicable) b) Search this field case insensitive for bot/crawler/spider/http (by convention only bots have url in agent string)
That will filter out most bot pollution. We still want those records in sampled log though.
Any thoughts?
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of emijrp Sent: Sunday, April 08, 2012 9:21 PM To: Wikimedia developers Cc: Diederik van Liere; Lars Aronsson Subject: Re: [Wikitech-l] Page views
2012/4/8 Erik Zachte ezachte@wikimedia.org
Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' | awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
How many of that 1000 sample log were robots (including all languages)?
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 04/11/2012 01:45 AM, Erik Zachte wrote:
Here are some numbers on total bot burden:
http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states for March 2012:
In total 69.5 M page requests (mime type text/html only!) per day are considered crawler requests, out of 696 M page requests (10.0%) or 469 M external page requests (14.8%). About half (35.1 M) of crawler requests come from Google.
The fraction will be larger than average (larger than 10%) for a) sites with many small pages (Wiktionary) and b) sites in languages with a smaller audience (Swedish sites). Bots will index these pages as they are found, but each of these pages can expect fewer search hits and less human traffic than long articles (Wikipedia) in languages with many speakers (English). The bot traffic is like a constant background noise, and the human traffic is the signal on top. Sites with many small pages and a small audience will have a lower signal-to-noise ratio. The long tail of seldom visited pages is drowning in that noise.
I should disclose that I "work for the competition". I tried to add books to Wikisource, but its complexity slows me down so I'm now focusing on my own Scandinavian book scanning website Project Runeberg, http://runeberg.org/
It has 700,000 scanned book pages, the same size as the English Wikisource, which is a large number of pages for a small language audience (mostly Swedish). Yesterday, April 10, its Apache access log had 291,000 hits, of which 116,000 are HTML pages, but 71,000 match bot/spider/crawler, leaving only 45,000 human page views. If Swedish Wikisource which is 1/20 that size would get 10-13 thousand human page views per day or 1/4 of that web traffic, I'd be surprised. It is more likely that 71/116 = 61% is bot traffic.
(Are we competitors? Really not. We're both liberating content. Swedish Wikipedia has more external links to runeberg.org than to any other website.)
My suggestion for how to filter these bots efficiently in c program (no costly nuanced regexps) before sending data to webstatscollector:
a) Find 14th field in space delimited log line = user agent (but beware of false delimiters in logs from varnish, if still applicable) b) Search this field case insensitive for bot/crawler/spider/http (by convention only bots have url in agent string)
That will filter out most bot pollution. We still want those records in sampled log though.
Any thoughts?
I did some research on fast string matching and it seems that the recently developed algorithm by Leonid Volnitsky is very fast (http://volnitsky.com/project/str_search/index.html). I will do some benchmarks vs the ordinary C strstr function but the author claims it's 20x faster.
So instead of hard coding where the bot information should be, just search the entire logline for the bot information and if it is present discard the logline and else process as-is.
Best, Diederik
On Mon, Apr 9, 2012 at 00:46, Erik Zachte ezachte@wikimedia.org wrote:
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
Is this the same case for mobile stats as well? I don't think there could be sudden 100% growth for 2 months now across wikis[1] without some reason like this.
[1] http://stats.wikimedia.org/EN_India/TablesPageViewsMonthlyMobile.htm
Hi Srikanth,
Yes, we are looking into the growth percentages as they seem unrealistically high. Best, Diederik
On Mon, Apr 9, 2012 at 3:30 AM, Srikanth Lakshmanan srik.lak@gmail.com wrote:
On Mon, Apr 9, 2012 at 00:46, Erik Zachte ezachte@wikimedia.org wrote:
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
Is this the same case for mobile stats as well? I don't think there could be sudden 100% growth for 2 months now across wikis[1] without some reason like this.
[1] http://stats.wikimedia.org/EN_India/TablesPageViewsMonthlyMobile.htm
-- Regards Srikanth.L
Since a few weeks Google bot crawls the mobile site (on purpose, they want to know what mobile-friendly content is on the web, even though it is similar to our main site).
This makes all the difference.
Here is a random example of how our traffic on smaller Wikipedias changed. http://stats.wikimedia.org/wikimedia/misc/MobileTrafficCA.png (based on Domas' hourly projectcounts files)
The peak occurred at 8 March 2 AM, we find 46957 page views for that one hour. In the 1:1000 sampled squid log we should find ~47 of those.
"zcat sampled-1000.log-20120308.gz | grep ca.m.wikipedia.org" yields 339 records for the whole day, most are Google bot.
(some more numbers on bot share of total traffic tomorrow)
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Diederik van Liere Sent: Monday, April 09, 2012 9:28 PM To: Srikanth Lakshmanan Cc: Wikimedia developers; Diederik van Liere; Lars Aronsson Subject: Re: [Wikitech-l] Page views
Hi Srikanth,
Yes, we are looking into the growth percentages as they seem unrealistically high. Best, Diederik
On Mon, Apr 9, 2012 at 3:30 AM, Srikanth Lakshmanan srik.lak@gmail.com wrote:
On Mon, Apr 9, 2012 at 00:46, Erik Zachte ezachte@wikimedia.org wrote:
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
Is this the same case for mobile stats as well? I don't think there could be sudden 100% growth for 2 months now across wikis[1] without some reason like this.
[1] http://stats.wikimedia.org/EN_India/TablesPageViewsMonthlyMobile.htm
-- Regards Srikanth.L
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org