Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' | awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file after removing javascript/json/robots.txt there are 13 left, which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
http://sv.wikisource.org/wiki/Snabbt_jagar_stormen_v%C3%A5ra_%C3%A5r,text/ht ml,Mozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_2)%20AppleWebKit/ 535.19%20(KHTML,%20like%20Gecko)%20Chrome/18.0.1025.142%20Safari/535.19 http://sv.wikisource.org/wiki/Special:Log?page=User:Sarahmaethomas,text/html ,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.ht ml) http://sv.wikisource.org/wiki/Underbar-k%C3%A4rlek-s%C3%A5-stor,text/html,Mo zilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.html) http://sv.wikisource.org/w/index.php?title=Diskussion%3aBer%c3%a4ttelser+ur+ svenska+historien%2fHedniska+tiden%2f2&redirect=no&action=raw&ctype=text/pla in&dontcountme=s,text/x-wiki,DotNetWikiBot/2.97%20(Microsoft%20Windows%20NT% 206.1.7601%20Service%20Pack%201;%20.NET%20CLR%202.0.50727.5448) http://sv.wikisource.org/wiki/Sida:SOU_1962_36.djvu/36,-,Mozilla/5.0%20(comp atible;%20Googlebot/2.1;%20+http://www.google.com/bot.html) http://sv.wikisource.org/wiki/Till_Polen,-,Mozilla/5.0%20(compatible;%20Goog lebot/2.1;%20+http://www.google.com/bot.html) http://sv.wikisource.org/wiki/Bibeln_1917/F%C3%B6rsta_Moseboken,text/html,Mo zilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_3)%20AppleWebKit/534.5 3.11%20(KHTML,%20like%20Gecko)%20Version/5.1.3%20Safari/534.53.10 http://sv.wikisource.org/wiki/Arbetare,text/html,Mozilla/5.0%20(compatible;% 20YandexBot/3.0;%20+http://yandex.com/bots) http://sv.wikisource.org/wiki/Industrin_och_kvinnofr,text/html,Mozilla/5.0%2 0(compatible;%20Baiduspider/2.0;%20+http://www.baidu.com/search/spider.html) http://sv.wikisource.org/wiki/Sida:Berzelius_Reseanteckningar_1903.djvu/120, text/html,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.c om/bot.html) http://sv.wikisource.org/wiki/Kategori:Ordspr%C3%A5ksboken,text/html,Mozilla /5.0%20(compatible;%20MSIE%209.0;%20Windows%20NT%206.1;%20Win64;%20x64;%20Tr ident/5.0) http://sv.wikisource.org/wiki/Special:L%C3%A4nkar_hit/Kategori:Karin_Boye,te xt/html,Mozilla/5.0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bot s) http://sv.wikisource.org/wiki/Sida:Om_arternas_uppkomst.djvu/235,-,Mozilla/5 .0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bots)
The page view report http://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthlyOriginal.htm is based on http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-04/ collected by http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/
By sheer coincidence we have been discussing filtering bots from the projectcounts files (at the source : webstatscollector) last week. Not a new discussion, but with current resources this may be feasible now, where it wasn't years ago.
Erik
-----Original Message----- From: Lars Aronsson [mailto:lars@aronsson.se] Sent: Sunday, April 08, 2012 2:24 AM To: Wikimedia developers Cc: Erik Zachte Subject: Page views
I'm telling people that the Swedish Wikipedia has 90-100 million page views per month or on average ten per month per Swedish citizen. This is based on stats.wikimedia.org (Wikistats), but is it really true? It would be really embarrassing if it were wrong by some order of magnitude.
There is of course a difference between the language and the country. Another measure says Internet users in Sweden (some 90 percent of all citizens) make 16 page views to Wikipedia per month, including all languages. Both numbers 10 or 16 make sense. But are they correct?
Wikistats also says Swedish Wikisource has 300-400 thousand page views per month, which would be 10-13 thousand per day on average. Knowing how small the Swedish Wikisource is (only 16,000 wiki pages + 37,000 facsimile pages), and comparing to other Swedish language websites, I'm surprised that Swedish Wikisource could attract even this much traffic. Now we're at such a small scale, that reading through a day's logfile with 13,000 lines is realistic for a human.
Is there a chance WMF could publish the logfile for Swedish Wikisource for a typical day, with just the IP addresses anonymized? Plus the source code that counts the number of page views, by filtering out accesses from robot crawlers and accesses to non-pages (like images and style sheets).
Page views for individual pages (on stats.grok.se) shows the Main page of Swedish Wikisource is shown 120 times/day while Recent changes is shown 160 times/day. From my own experience, contributors are the only ones to look at Recent changes, while they almost never look at the Main page. If IP addresses are scrambled but not removed, the log file should be able to show this pattern. Is it possible to tell apart the IP addresses for contributors and non-contributors, and present page views from each category?