Rob,
I'm not completely sure whether or not you are talking about the same logging infrastructure that leads to our traffic stats at stats.grok.se [1]. However, having worked with those stats and the raw files provides by Domas [2], I am pretty sure that those squid traffic stats are intended to be a complete traffic sample (or nearly so) and not a 1/1000 sample.
We have done various fractionated samples in the past, but I believe the squid logs used for traffic stats at the present time are not fractionated.
If you are talking about a different process of logging not associated with the traffic logs, then I apologize for my confusion.
-Robert Rohde
[1] http://stats.grok.se/ [2] http://dammit.lt/wikistats/
On Mon, Aug 9, 2010 at 10:16 PM, Rob Lanphier robla@wikimedia.org wrote:
Hi everyone,
We're in the process of figuring out how we fix some of the issues in our logging infrastructure. I'm both sending this email out to get the more knowledgeable folks to chime in about where I've got the details wrong, and for general comment on how we're doing our logging. We may need to recruit contract developers to work on this stuff, so we want to make sure we have clear and accurate information available, and we need to figure out what exactly we want to direct those people to do.
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.
At any rate, there are a couple of problems with the way that it works: 1. Once we saturate the NIC on the logging machine, the quality of our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months. 2. We'd like to increase the granularity of logging so that we can do more sophiticated analysis. For example, if we decide to run a test banner to a limited audience, we need to make sure we're getting more complete logs for that audience or else we're not getting enough data to do any useful analysis.
If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?
Rob
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l