[Wikitech-l] Wikimedia logging infrastructure

10 Aug 2010


      Hi everyone,
We're in the process of figuring out how we fix some of the issues in
our logging infrastructure.  I'm both sending this email out to get
the more knowledgeable folks to chime in about where I've got the
details wrong, and for general comment on how we're doing our logging.
 We may need to recruit contract developers to work on this stuff, so
we want to make sure we have clear and accurate information available,
and we need to figure out what exactly we want to direct those people
to do.
We have a single collection point for all of our logging, which is
actually just a sampling of the overall traffic (designed to be
roughly one out of every 1000 hits).  The process is described here:
http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/
...but I'm a little unclear what the relationship between that code
and code in trunk/udplog.
At any rate, there are a couple of problems with the way that it works:
1.  Once we saturate the NIC on the logging machine, the quality of
our sampling degrades pretty rapidly.  We've generally had a problem
with that over the past few months.
2.  We'd like to increase the granularity of logging so that we can do
more sophiticated analysis.  For example, if we decide to run a test
banner to a limited audience, we need to make sure we're getting more
complete logs for that audience or else we're not getting enough data
to do any useful analysis.
If this were your typical commercial operation, the answer would be
"why aren't you just logging into Streambase?" (or some other data
warehousing storage solution).  I'm not suggesting that we do that (or
even look at any of the solutions that bill themselves as open source
alternatives), since, while our needs are increasing, we still aren't
planning to be anywhere near as sophisticated as a lot of data
tracking orgs.  Still, it's worth asking questions about our existing
setup.  Should we be looking optimize our existing single-box setup,
extending our software to have multi-node collection, or looking at a
whole new collection strategy?
Rob

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikimedia logging infrastructure