Hi everyone,
We're in the process of figuring out how we fix some of the issues in our logging infrastructure. I'm both sending this email out to get the more knowledgeable folks to chime in about where I've got the details wrong, and for general comment on how we're doing our logging. We may need to recruit contract developers to work on this stuff, so we want to make sure we have clear and accurate information available, and we need to figure out what exactly we want to direct those people to do.
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.
At any rate, there are a couple of problems with the way that it works: 1. Once we saturate the NIC on the logging machine, the quality of our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months. 2. We'd like to increase the granularity of logging so that we can do more sophiticated analysis. For example, if we decide to run a test banner to a limited audience, we need to make sure we're getting more complete logs for that audience or else we're not getting enough data to do any useful analysis.
If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?
Rob