Brion Vibber wrote:
Miguel Chaves wrote:
Hi, I wonder if wikipedia only relies on this
sort of external statistics
(like Alexa) to gather information about visits to the sites.
Aren't there statistcs collected on wikipedia servers itself? This would be
more useful and reliable.
Not at this time. At our traffic level, web server logs are too large to handle
comfortably without a dedicated infrastructure, and we've been forced to simply
disable them until something easier to handle gets set up.
(If we were an ad-supported site, such statistics would be much much more
important and we'd have put in the time and money for it a lot sooner.)
BTW, if we want to know the popularity of an
specific article (not a
specific wikipedia), is there a tool for that?
Not really, sorry.
-- brion vibber (brion @
pobox.com)
Since the traffic is so vast, why not use random sampling? At each page
hit, call a random-number generator (eg read four bytes from
/dev/urandom, or call a seeded pseudo-random number routine), and make a
log entry only if its result == 0 mod 1000. That way, the logs will be
statistically representative, but only require a relatively tiny amount
of disk I/O, compute time, and disk space.
Alternatively, you could log using UDP syslog, and have a listener that
threw away 999 out of 1000 packets.
-- Neil