Brion Vibber wrote:
Miguel Chaves wrote:
Hi, I wonder if wikipedia only relies on this sort of external statistics (like Alexa) to gather information about visits to the sites. Aren't there statistcs collected on wikipedia servers itself? This would be more useful and reliable.
Not at this time. At our traffic level, web server logs are too large to handle comfortably without a dedicated infrastructure, and we've been forced to simply disable them until something easier to handle gets set up.
(If we were an ad-supported site, such statistics would be much much more important and we'd have put in the time and money for it a lot sooner.)
BTW, if we want to know the popularity of an specific article (not a specific wikipedia), is there a tool for that?
Not really, sorry.
-- brion vibber (brion @ pobox.com)
Since the traffic is so vast, why not use random sampling? At each page hit, call a random-number generator (eg read four bytes from /dev/urandom, or call a seeded pseudo-random number routine), and make a log entry only if its result == 0 mod 1000. That way, the logs will be statistically representative, but only require a relatively tiny amount of disk I/O, compute time, and disk space.
Alternatively, you could log using UDP syslog, and have a listener that threw away 999 out of 1000 packets.
-- Neil