Hi everyone,
We're in the process of figuring out how we fix some of the issues in our logging infrastructure. I'm both sending this email out to get the more knowledgeable folks to chime in about where I've got the details wrong, and for general comment on how we're doing our logging. We may need to recruit contract developers to work on this stuff, so we want to make sure we have clear and accurate information available, and we need to figure out what exactly we want to direct those people to do.
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.
At any rate, there are a couple of problems with the way that it works: 1. Once we saturate the NIC on the logging machine, the quality of our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months. 2. We'd like to increase the granularity of logging so that we can do more sophiticated analysis. For example, if we decide to run a test banner to a limited audience, we need to make sure we're getting more complete logs for that audience or else we're not getting enough data to do any useful analysis.
If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?
Rob
On 10/08/10 15:16, Rob Lanphier wrote:
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.
Maybe you should find out who wrote the relevant code and set up the relevant infrastructure, and ask them directly. It's not difficult to find out who it was.
At any rate, there are a couple of problems with the way that it works:
- Once we saturate the NIC on the logging machine, the quality of
our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months.
We haven't saturated any NICs.
-- Tim Starling
On Mon, Aug 9, 2010 at 11:17 PM, Tim Starling tstarling@wikimedia.org wrote:
On 10/08/10 15:16, Rob Lanphier wrote:
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.
Maybe you should find out who wrote the relevant code and set up the relevant infrastructure, and ask them directly. It's not difficult to find out who it was.
Well, yes, I was hoping you'd weigh in on this thread.
At any rate, there are a couple of problems with the way that it works:
- Once we saturate the NIC on the logging machine, the quality of
our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months.
We haven't saturated any NICs.
Sorry, I assumed it was a NIC. There has been packet loss, from what I understand. I'll leave it at that.
Rob
Rob,
I'm not completely sure whether or not you are talking about the same logging infrastructure that leads to our traffic stats at stats.grok.se [1]. However, having worked with those stats and the raw files provides by Domas [2], I am pretty sure that those squid traffic stats are intended to be a complete traffic sample (or nearly so) and not a 1/1000 sample.
We have done various fractionated samples in the past, but I believe the squid logs used for traffic stats at the present time are not fractionated.
If you are talking about a different process of logging not associated with the traffic logs, then I apologize for my confusion.
-Robert Rohde
[1] http://stats.grok.se/ [2] http://dammit.lt/wikistats/
On Mon, Aug 9, 2010 at 10:16 PM, Rob Lanphier robla@wikimedia.org wrote:
Hi everyone,
We're in the process of figuring out how we fix some of the issues in our logging infrastructure. I'm both sending this email out to get the more knowledgeable folks to chime in about where I've got the details wrong, and for general comment on how we're doing our logging. We may need to recruit contract developers to work on this stuff, so we want to make sure we have clear and accurate information available, and we need to figure out what exactly we want to direct those people to do.
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging
My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.
At any rate, there are a couple of problems with the way that it works: 1. Once we saturate the NIC on the logging machine, the quality of our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months. 2. We'd like to increase the granularity of logging so that we can do more sophiticated analysis. For example, if we decide to run a test banner to a limited audience, we need to make sure we're getting more complete logs for that audience or else we're not getting enough data to do any useful analysis.
If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?
Rob
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Robert Rohde schrieb:
Rob,
I'm not completely sure whether or not you are talking about the same logging infrastructure that leads to our traffic stats at stats.grok.se [1]. However, having worked with those stats and the raw files provides by Domas [2], I am pretty sure that those squid traffic stats are intended to be a complete traffic sample (or nearly so) and not a 1/1000 sample.
A lot of people seem to be confused about this. After getting contradicting info at WikiSym about this, I asked Domas - and he confirmed that it counts the complete traffic, it's not sampled. But it's only *counting*.
Any extra logs that contain detailed info about individual requests *are* sampled, and are generally temporary.
-- daniel
On 10-08-10 07:16, Rob Lanphier wrote:
At any rate, there are a couple of problems with the way that it works:
- Once we saturate the NIC on the logging machine, the quality of
our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months.
As already stated elsewhere, we didn't really saturate any NICs, just some socket buffers. Because of the large number of configured log pipes, the software (udp2log) could not empty the socket buffers fast enough.
If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?
Besides the ideas that are currently being kicked around of improving or rewriting the udp log collection software, there's also always the short-term, easy option of sending a multicast UDP stream, and having multiple collectors with distinct log pipes setup. E.g. one machine for the sampled logging, and another, independent machine to do all the special purpose log streams. I do like more efficient software solutions rather than throwing more iron at the problem, though. :)
Hi!
multiple collectors with distinct log pipes setup. E.g. one machine for the sampled logging, and another, independent machine to do all the special purpose log streams. I do like more efficient software solutions rather than throwing more iron at the problem, though. :)
Frankly, we could have same on single machine - e.g. two listeners on same multicast stream - for SMP perf :-)
Domas
Hi Mark,
Thanks for the helpful reply. Comments inline:
On Tue, Aug 10, 2010 at 2:54 AM, Mark Bergsma mark@wikimedia.org wrote:
As already stated elsewhere, we didn't really saturate any NICs, just some socket buffers. Because of the large number of configured log pipes, the software (udp2log) could not empty the socket buffers fast enough.
Based on this and IRC conversations with Tim and Domas, here's my understanding of things now (restating to make sure that I understand):
The current system is a single-threaded application that takes packets in synchronously, and spits them out to several places based on the configuration file described here: http://wikitech.wikimedia.org/view/Squid_logging
One problem that we're hitting is that the configuration of this daemon^H^H^H^H^H^Hlistener is that when it gets too bogged down with a complex configuration, it doesn't get around to emptying the socket buffer. Since it's single threaded, it's handling each of the configured logging destinations before reading the next packet. We're not CPU-bound at this point. The existing solution seems to start flaking out at 40% CPU with a complicated configuration, and is humming along at 20% with the current simplified config. The problem is that we're blocking while we fire up awk or whatever on the logging side, and overflowing the socket buffer.
A solution that Tim and others are kicking around is reworking the listener in one or more of the following ways: 1. Move to some non-blocking networking library (e.g. Boost asio, libevent) 2. Go multi-threaded
Mark, as you point out, we could go with some multicast solution if we need to split it up among boxes. As Domas points out, we could even go multi-process on the same box without really maxing it out.
The solutions we're talking about seem to solve the socket buffer problem, but it sounds like we may also need to get some clearer requirements on any new functionality that's needed. It sounds like we'll be able to get some more mileage out of the existing solution with some of the reworking described above. It's not entirely clear yet if this buys us enough capacity+capability for the increased requirements. I'll check in with Tomasz and others working on fundraiser stuff to find out more.
Rob
Rob Lanphier wrote:
Since it's single threaded, it's handling each of the configured logging destinations before reading the next packet. We're not CPU-bound at this point. The existing solution seems to start flaking out at 40% CPU with a complicated configuration, and is humming along at 20% with the current simplified config. The problem is that we're blocking while we fire up awk or whatever on the logging side, and overflowing the socket buffer.
It is only launched once, then reused. So the problem is that they don't eat the pipe data fast enough, so the pipe accumulates 64Kb and finally blocks.
A solution that Tim and others are kicking around is reworking the listener in one or more of the following ways:
- Move to some non-blocking networking library (e.g. Boost asio, libevent)
- Go multi-threaded
The solutions we're talking about seem to solve the socket buffer problem, but it sounds like we may also need to get some clearer requirements on any new functionality that's needed. It sounds like we'll be able to get some more mileage out of the existing solution with some of the reworking described above. It's not entirely clear yet if this buys us enough capacity+capability for the increased requirements. I'll check in with Tomasz and others working on fundraiser stuff to find out more.
Rob
Going multithread is really easy for a socket listener. However, not so much in the LogProcessors. If they are shared accross threads, you may end up with all threads blocked in the fwrite and if they aren't shared, the files may easily corrupt (depends on what you are exactly doing with them).
Since the problem is that the socket buffer fills, it surprised me that the server didn't increase SO_RCVBUF. That's not a solution but should help (already set in /proc/sys/net/core/rmem_default ?).
The real issue is: what are you placing on your pipes that are so slow to read from them? Optimizing those scripts could be a simpler solution. Wouldn't be hard to make the pipe writes non-blocking, properly blaming the slow pipes that couldn't be written
Hi!
Going multithread is really easy for a socket listener.
Really? :)
However, not so much in the LogProcessors. If they are shared accross threads, you may end up with all threads blocked in the fwrite and if they aren't shared, the files may easily corrupt (depends on what you are exactly doing with them).
I don't really understand what you say ;-) Do you mean lost data as 'corrupt'?
Since the problem is that the socket buffer fills, it surprised me that the server didn't increase SO_RCVBUF. That's not a solution but should help (already set in /proc/sys/net/core/rmem_default ?).
It is long term CPU saturation issue - mux process isn't fast enough to handle 16 output streams. Do note, there're quite a few events a second :)
The real issue is: what are you placing on your pipes that are so slow to read from them? Optimizing those scripts could be a simpler solution.
No, those scripts are not the bottleneck, there's plenty of CPU available, and they are not blocking (for too long, everything is blocking for a certain amount of time ;-)
Wouldn't be hard to make the pipe writes non-blocking, properly blaming the slow pipes that couldn't be written
There are no slow pipes. Bottleneck is udp2log step.
Domas
Domas Mituzas wrote:
Hi!
Going multithread is really easy for a socket listener.
Really? :)
Sure. Make each thread call accept and let the kernel give incoming sockets to one of them. There you have the listener done :) Solaris used to need an explicit locking, but it is now fixed there, too.
However, not so much in the LogProcessors. If they are shared accross threads, you may end up with all threads blocked in the fwrite and if they aren't shared, the files may easily corrupt (depends on what you are exactly doing with them).
I don't really understand what you say ;-) Do you mean lost data as 'corrupt'?
Given the following incomint events: udp2log has problems jeluf created a new wiki domas fixed the server
I call corrupted this: jeluf domas udp2log has fixed the server problems created a new wiki
Since the problem is that the socket buffer fills, it surprised me that the server didn't increase SO_RCVBUF. That's not a solution but should help (already set in /proc/sys/net/core/rmem_default ?).
It is long term CPU saturation issue - mux process isn't fast enough to handle 16 output streams. Do note, there're quite a few events a second :)
The real issue is: what are you placing on your pipes that are so slow to read from them? Optimizing those scripts could be a simpler solution.
No, those scripts are not the bottleneck, there's plenty of CPU available, and they are not blocking (for too long, everything is blocking for a certain amount of time ;-)
Wouldn't be hard to make the pipe writes non-blocking, properly blaming the slow pipes that couldn't be written
There are no slow pipes. Bottleneck is udp2log step.
Domas
I don't get it. What is slow on it?
What it does is: 1) Get socket data 2) Split line into pieces 3) fwrite each line in 16 fds 4) Go to 1
If there's plenty of CPU, the pipes doesn't fill, the fwrite doesn't block... Why isn't it coping with it? Too much time lost in context changes?
Hi!
Sure. Make each thread call accept and let the kernel give incoming sockets to one of them. There you have the listener done :) Solaris used to need an explicit locking, but it is now fixed there, too.
Heh, I somewhat ignored this way - yeah, it would work just fine - one can do per-file synchronization rather than per-event, as there's not much state involved on either side.
Given the following incomint events: udp2log has problems jeluf created a new wiki domas fixed the server
I call corrupted this: jeluf domas udp2log has fixed the server problems created a new wiki
Well, you wouldn't want to use fwrite/etc calls, as their behavior in threaded environment isn't that useful :) write()s aren't atomic either, so... what you have to do is:
lock(fd); write(); write(); write(); (may be needed for single buffer, in case first write is not complete) unlock(fd);
I don't get it. What is slow on it?
What it does is:
- Get socket data
- Split line into pieces
- fwrite each line in 16 fds
- Go to 1
1) Get socket data 2) Split packet into lines 3) Write lines into 16 fds 4) Go to 1
If there's plenty of CPU, the pipes doesn't fill, the fwrite doesn't block... Why isn't it coping with it? Too much time lost in context changes?
There're no context changes, as it running fully on a core. "plenty of CPU" is 100% core use, most of time is spent in write(), and apparently syscalls aren't free.
Domas
On Thu, Aug 12, 2010 at 10:54 AM, Domas Mituzas midom.lists@gmail.com wrote:
There're no context changes, as it running fully on a core. "plenty of CPU" is 100% core use, most of time is spent in write(), and apparently syscalls aren't free.
Without having looked at any code, can't the threads just add data to a semaphore linked list (fast), and a single separate thread writes the stuff to disk occasionally?
Magnus
Without having looked at any code, can't the threads just add data to a semaphore linked list (fast), and a single separate thread writes the stuff to disk occasionally?
Isn't that the usual error that threaded software developers do:
1. get all threads depend on single mutex 2. watch them fight! (you'd get a million wakeups here a second :-)
as a bonus point you get a need to copy data to a separate buffer or frenzy memory allocating with another mutex for malloc/free ;-)
Domas
On Fri, Aug 13, 2010 at 6:16 AM, Domas Mituzas midom.lists@gmail.com wrote:
Without having looked at any code, can't the threads just add data to a semaphore linked list (fast), and a single separate thread writes the stuff to disk occasionally?
Isn't that the usual error that threaded software developers do:
- get all threads depend on single mutex
- watch them fight! (you'd get a million wakeups here a second :-)
But adding to a linked list (essentially, changing a pointer) is about the cheapest operation there is...
as a bonus point you get a need to copy data to a separate buffer or frenzy memory allocating with another mutex for malloc/free ;-)
Disk dump thread: * Get mutex for list start pointer * Copy list start pointer * Reset list start pointer = NULL * Release mutex * Write list to disk * Release memory
If you allocate memory per list item, the freed ones should nicely fit the next ones, so malloc would not be too slow, I imagine (just use char[] of fixed size).
Just curious: is the "million wakeups" an actual number, or a figure of speech? How many views/sec are there?
Cheers, Magnus
On Fri, Aug 13, 2010 at 8:55 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Disk dump thread:
- Get mutex for list start pointer
- Copy list start pointer
- Reset list start pointer = NULL
- Release mutex
- Write list to disk
- Release memory
If you allocate memory per list item, the freed ones should nicely fit the next ones, so malloc would not be too slow, I imagine (just use char[] of fixed size).
While we're having fun speculating on possible designs without actually volunteering to write the code ;): wouldn't a circular buffer make more sense than a linked list? You could have one buffer per thread, to avoid locking for the end pointer, and then the thread that writes the logs to disk can just go around all the buffers in turn, grabbing the requests from the start, ordering them by timestamp, and writing them out in the correct order. I don't think this would need any locking at all, and if you make the buffer large enough that you don't have to worry about the dump thread not being able to keep up, you wouldn't even have any cache bouncing.
You could also avoid malloc()/free() issues by just preallocating all the buffers, with only the dump thread doing malloc()/free() during operation. If you keep enough buffer space for 100k requests total, and each request takes 1 KB, that's only 100 MB.
Now I want Domas to shoot holes in my idea too! :)
Just curious: is the "million wakeups" an actual number, or a figure of speech? How many views/sec are there?
Current #wikimedia-tech topic says that peak load is about 100k req/s. Assuming a million per second is a reasonable idea for future-proofing.
On 14/08/10 03:56, Aryeh Gregor wrote:
While we're having fun speculating on possible designs without actually volunteering to write the code ;): wouldn't a circular buffer make more sense than a linked list?
[...]
Now I want Domas to shoot holes in my idea too! :)
He might have shot holes in it if he hadn't suggested it 5 days ago:
<domas> my armchair architect idea was writing to some cyclical shared buffer <domas> and allowing other threads to pick stuff from it
The tricky thing would be efficiently synchronising updates to the queue, so that reads and writes are atomic. You wouldn't need to do many syscalls in the main thread to make it slower than the current version.
-- Tim Starling
On Sun, Aug 15, 2010 at 12:07 AM, Tim Starling tstarling@wikimedia.org wrote:
He might have shot holes in it if he hadn't suggested it 5 days ago:
<domas> my armchair architect idea was writing to some cyclical shared buffer <domas> and allowing other threads to pick stuff from it
The tricky thing would be efficiently synchronising updates to the queue, so that reads and writes are atomic. You wouldn't need to do many syscalls in the main thread to make it slower than the current version.
Have one circular buffer per write thread. The pointer to the front of each buffer would only be written by the thread that writes to that buffer, after it's written the new entry to the buffer. The pointer to the back would only be written by the one thread that reads all the buffers. Since each pointer is a single word long and is only written to by a single thread, no locks are needed anywhere, and no more syscalls than at present.
wikitech-l@lists.wikimedia.org