Hi Lars,
You have a point here, especially for smaller projects:
For Swedish Wikisource:
zcat sampled-1000.log-20120404.gz | grep 'GET
http://sv.wikisource.org' |
awk '{print $9, $11,$14}'
returns 20 lines from this 1:1000 sampled squid log file
after removing javascript/json/robots.txt there are 13 left,
which fits perfectly with 10,000 to 13,000 per day
however 9 of these are bots!!
http://sv.wikisource.org/wiki/Snabbt_jagar_stormen_v%C3%A5ra_%C3%A5r,text/ht
ml,Mozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_2)%20AppleWebKit/
535.19%20(KHTML,%20like%20Gecko)%20Chrome/18.0.1025.142%20Safari/535.19
http://sv.wikisource.org/wiki/Special:Log?page=User:Sarahmaethomas,text/html
,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.ht
ml)
http://sv.wikisource.org/wiki/Underbar-k%C3%A4rlek-s%C3%A5-stor,text/html,Mo
zilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.html)
http://sv.wikisource.org/w/index.php?title=Diskussion%3aBer%c3%a4ttelser+ur+
svenska+historien%2fHedniska+tiden%2f2&redirect=no&action=raw&ctype=text/pla
in&dontcountme=s,text/x-wiki,DotNetWikiBot/2.97%20(Microsoft%20Windows%20NT%
206.1.7601%20Service%20Pack%201;%20.NET%20CLR%202.0.50727.5448)
http://sv.wikisource.org/wiki/Sida:SOU_1962_36.djvu/36,-,Mozilla/5.0%20(comp
atible;%20Googlebot/2.1;%20+http://www.google.com/bot.html)
http://sv.wikisource.org/wiki/Till_Polen,-,Mozilla/5.0%20(compatible;%20Goog
lebot/2.1;%20+http://www.google.com/bot.html)
http://sv.wikisource.org/wiki/Bibeln_1917/F%C3%B6rsta_Moseboken,text/html,Mo
zilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_3)%20AppleWebKit/534.5
3.11%20(KHTML,%20like%20Gecko)%20Version/5.1.3%20Safari/534.53.10
http://sv.wikisource.org/wiki/Arbetare,text/html,Mozilla/5.0%20(compatible;%
20YandexBot/3.0;%20+http://yandex.com/bots)
http://sv.wikisource.org/wiki/Industrin_och_kvinnofr,text/html,Mozilla/5.0%2
0(compatible;%20Baiduspider/2.0;%20+http://www.baidu.com/search/spider.html)
http://sv.wikisource.org/wiki/Sida:Berzelius_Reseanteckningar_1903.djvu/120,
text/html,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.c
om/bot.html)
http://sv.wikisource.org/wiki/Kategori:Ordspr%C3%A5ksboken,text/html,Mozilla
/5.0%20(compatible;%20MSIE%209.0;%20Windows%20NT%206.1;%20Win64;%20x64;%20Tr
ident/5.0)
http://sv.wikisource.org/wiki/Special:L%C3%A4nkar_hit/Kategori:Karin_Boye,te
xt/html,Mozilla/5.0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bot
s)
http://sv.wikisource.org/wiki/Sida:Om_arternas_uppkomst.djvu/235,-,Mozilla/5
.0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bots)
The page view report
http://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthlyOriginal.htm
is based on
http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-04/
collected by
http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/
By sheer coincidence we have been discussing filtering bots from the
projectcounts files (at the source : webstatscollector) last week.
Not a new discussion, but with current resources this may be feasible now,
where it wasn't years ago.
Erik
-----Original Message-----
From: Lars Aronsson [mailto:lars@aronsson.se]
Sent: Sunday, April 08, 2012 2:24 AM
To: Wikimedia developers
Cc: Erik Zachte
Subject: Page views
I'm telling people that the Swedish Wikipedia has 90-100 million page views
per month or on average ten per month per Swedish citizen. This is based on
stats.wikimedia.org (Wikistats), but is it really true? It would be really
embarrassing if it were wrong by some order of magnitude.
There is of course a difference between the language and the country.
Another measure says Internet users in Sweden (some 90 percent of all
citizens) make 16 page views to Wikipedia per month, including all
languages. Both numbers
10 or 16 make sense. But are they correct?
Wikistats also says Swedish Wikisource has 300-400 thousand page views per
month, which would be 10-13 thousand per day on average. Knowing how small
the Swedish Wikisource is (only
16,000 wiki pages + 37,000 facsimile pages), and comparing to other Swedish
language websites, I'm surprised that Swedish Wikisource could attract even
this much traffic.
Now we're at such a small scale, that reading through a day's logfile with
13,000 lines is realistic for a human.
Is there a chance WMF could publish the logfile for Swedish Wikisource for a
typical day, with just the IP addresses anonymized? Plus the source code
that counts the number of page views, by filtering out accesses from robot
crawlers and accesses to non-pages (like images and style sheets).
Page views for individual pages (on stats.grok.se) shows the Main page of
Swedish Wikisource is shown 120 times/day while Recent changes is shown 160
times/day. From my own experience, contributors are the only ones to look at
Recent changes, while they almost never look at the Main page. If IP
addresses are scrambled but not removed, the log file should be able to show
this pattern. Is it possible to tell apart the IP addresses for contributors
and non-contributors, and present page views from each category?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se