Wikimedia logging infrastructure

List overview All Threads
Download

newer

older

How to use @since annotations

Image redirects - has the behavior...

Rob Lanphier

10 Aug 2010 10 Aug '10

3:16 p.m.

Hi everyone,

We're in the process of figuring out how we fix some of the issues in our logging infrastructure. I'm both sending this email out to get the more knowledgeable folks to chime in about where I've got the details wrong, and for general comment on how we're doing our logging. We may need to recruit contract developers to work on this stuff, so we want to make sure we have clear and accurate information available, and we need to figure out what exactly we want to direct those people to do.

We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging

My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.

At any rate, there are a couple of problems with the way that it works: 1. Once we saturate the NIC on the logging machine, the quality of our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months. 2. We'd like to increase the granularity of logging so that we can do more sophiticated analysis. For example, if we decide to run a test banner to a limited audience, we need to make sure we're getting more complete logs for that audience or else we're not getting enough data to do any useful analysis.

If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?

Rob

Show replies by date

Tim Starling

10 Aug 10 Aug

4:17 p.m.

On 10/08/10 15:16, Rob Lanphier wrote:

...

We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging

My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.

Maybe you should find out who wrote the relevant code and set up the relevant infrastructure, and ask them directly. It's not difficult to find out who it was.

...

At any rate, there are a couple of problems with the way that it works:

Once we saturate the NIC on the logging machine, the quality of

our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months.

We haven't saturated any NICs.

-- Tim Starling

Rob Lanphier

4:38 p.m.

On Mon, Aug 9, 2010 at 11:17 PM, Tim Starling tstarling@wikimedia.org wrote:

...

On 10/08/10 15:16, Rob Lanphier wrote:

...
We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging

My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.

Maybe you should find out who wrote the relevant code and set up the relevant infrastructure, and ask them directly. It's not difficult to find out who it was.

Well, yes, I was hoping you'd weigh in on this thread.

...

...
At any rate, there are a couple of problems with the way that it works:

Once we saturate the NIC on the logging machine, the quality of

our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months.

We haven't saturated any NICs.

Sorry, I assumed it was a NIC. There has been packet loss, from what I understand. I'll leave it at that.

Rob

Robert Rohde

5:36 p.m.

Rob,

I'm not completely sure whether or not you are talking about the same logging infrastructure that leads to our traffic stats at stats.grok.se [1]. However, having worked with those stats and the raw files provides by Domas [2], I am pretty sure that those squid traffic stats are intended to be a complete traffic sample (or nearly so) and not a 1/1000 sample.

We have done various fractionated samples in the past, but I believe the squid logs used for traffic stats at the present time are not fractionated.

If you are talking about a different process of logging not associated with the traffic logs, then I apologize for my confusion.

-Robert Rohde

[1] http://stats.grok.se/ [2] http://dammit.lt/wikistats/

On Mon, Aug 9, 2010 at 10:16 PM, Rob Lanphier robla@wikimedia.org wrote:

...

Hi everyone,

We're in the process of figuring out how we fix some of the issues in our logging infrastructure. I'm both sending this email out to get the more knowledgeable folks to chime in about where I've got the details wrong, and for general comment on how we're doing our logging. We may need to recruit contract developers to work on this stuff, so we want to make sure we have clear and accurate information available, and we need to figure out what exactly we want to direct those people to do.

We have a single collection point for all of our logging, which is actually just a sampling of the overall traffic (designed to be roughly one out of every 1000 hits). The process is described here: http://wikitech.wikimedia.org/view/Squid_logging

My understanding is that this code is also involved somewhere: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ ...but I'm a little unclear what the relationship between that code and code in trunk/udplog.

At any rate, there are a couple of problems with the way that it works: 1. Once we saturate the NIC on the logging machine, the quality of our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months. 2. We'd like to increase the granularity of logging so that we can do more sophiticated analysis. For example, if we decide to run a test banner to a limited audience, we need to make sure we're getting more complete logs for that audience or else we're not getting enough data to do any useful analysis.

If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?

Rob

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

6:11 p.m.

Robert Rohde schrieb:

...

Rob,

I'm not completely sure whether or not you are talking about the same logging infrastructure that leads to our traffic stats at stats.grok.se [1]. However, having worked with those stats and the raw files provides by Domas [2], I am pretty sure that those squid traffic stats are intended to be a complete traffic sample (or nearly so) and not a 1/1000 sample.

A lot of people seem to be confused about this. After getting contradicting info at WikiSym about this, I asked Domas - and he confirmed that it counts the complete traffic, it's not sampled. But it's only *counting*.

Any extra logs that contain detailed info about individual requests *are* sampled, and are generally temporary.

-- daniel

Mark Bergsma

7:54 p.m.

On 10-08-10 07:16, Rob Lanphier wrote:

...

At any rate, there are a couple of problems with the way that it works:

Once we saturate the NIC on the logging machine, the quality of

our sampling degrades pretty rapidly. We've generally had a problem with that over the past few months.

As already stated elsewhere, we didn't really saturate any NICs, just some socket buffers. Because of the large number of configured log pipes, the software (udp2log) could not empty the socket buffers fast enough.

...

If this were your typical commercial operation, the answer would be "why aren't you just logging into Streambase?" (or some other data warehousing storage solution). I'm not suggesting that we do that (or even look at any of the solutions that bill themselves as open source alternatives), since, while our needs are increasing, we still aren't planning to be anywhere near as sophisticated as a lot of data tracking orgs. Still, it's worth asking questions about our existing setup. Should we be looking optimize our existing single-box setup, extending our software to have multi-node collection, or looking at a whole new collection strategy?

Besides the ideas that are currently being kicked around of improving or rewriting the udp log collection software, there's also always the short-term, easy option of sending a multicast UDP stream, and having multiple collectors with distinct log pipes setup. E.g. one machine for the sampled logging, and another, independent machine to do all the special purpose log streams. I do like more efficient software solutions rather than throwing more iron at the problem, though. :)

-- Mark Bergsma mark@wikimedia.org Operations Engineer, Wikimedia Foundation

Domas Mituzas

10:01 p.m.

Hi!

...

multiple collectors with distinct log pipes setup. E.g. one machine for the sampled logging, and another, independent machine to do all the special purpose log streams. I do like more efficient software solutions rather than throwing more iron at the problem, though. :)

Frankly, we could have same on single machine - e.g. two listeners on same multicast stream - for SMP perf :-)

Domas

Rob Lanphier

11 Aug 11 Aug

4:19 a.m.

Hi Mark,

Thanks for the helpful reply. Comments inline:

On Tue, Aug 10, 2010 at 2:54 AM, Mark Bergsma mark@wikimedia.org wrote:

...

As already stated elsewhere, we didn't really saturate any NICs, just some socket buffers. Because of the large number of configured log pipes, the software (udp2log) could not empty the socket buffers fast enough.

Based on this and IRC conversations with Tim and Domas, here's my understanding of things now (restating to make sure that I understand):

The current system is a single-threaded application that takes packets in synchronously, and spits them out to several places based on the configuration file described here: http://wikitech.wikimedia.org/view/Squid_logging

One problem that we're hitting is that the configuration of this daemon^H^H^H^H^H^Hlistener is that when it gets too bogged down with a complex configuration, it doesn't get around to emptying the socket buffer. Since it's single threaded, it's handling each of the configured logging destinations before reading the next packet. We're not CPU-bound at this point. The existing solution seems to start flaking out at 40% CPU with a complicated configuration, and is humming along at 20% with the current simplified config. The problem is that we're blocking while we fire up awk or whatever on the logging side, and overflowing the socket buffer.

A solution that Tim and others are kicking around is reworking the listener in one or more of the following ways: 1. Move to some non-blocking networking library (e.g. Boost asio, libevent) 2. Go multi-threaded

Mark, as you point out, we could go with some multicast solution if we need to split it up among boxes. As Domas points out, we could even go multi-process on the same box without really maxing it out.

The solutions we're talking about seem to solve the socket buffer problem, but it sounds like we may also need to get some clearer requirements on any new functionality that's needed. It sounds like we'll be able to get some more mileage out of the existing solution with some of the reworking described above. It's not entirely clear yet if this buys us enough capacity+capability for the increased requirements. I'll check in with Tomasz and others working on fundraiser stuff to find out more.

Rob

Platonides

7:21 a.m.

Rob Lanphier wrote:

...

Since it's single threaded, it's handling each of the configured logging destinations before reading the next packet. We're not CPU-bound at this point. The existing solution seems to start flaking out at 40% CPU with a complicated configuration, and is humming along at 20% with the current simplified config. The problem is that we're blocking while we fire up awk or whatever on the logging side, and overflowing the socket buffer.

It is only launched once, then reused. So the problem is that they don't eat the pipe data fast enough, so the pipe accumulates 64Kb and finally blocks.

...

A solution that Tim and others are kicking around is reworking the listener in one or more of the following ways:

Move to some non-blocking networking library (e.g. Boost asio, libevent)

Go multi-threaded

The solutions we're talking about seem to solve the socket buffer problem, but it sounds like we may also need to get some clearer requirements on any new functionality that's needed. It sounds like we'll be able to get some more mileage out of the existing solution with some of the reworking described above. It's not entirely clear yet if this buys us enough capacity+capability for the increased requirements. I'll check in with Tomasz and others working on fundraiser stuff to find out more.

Rob

Going multithread is really easy for a socket listener. However, not so much in the LogProcessors. If they are shared accross threads, you may end up with all threads blocked in the fwrite and if they aren't shared, the files may easily corrupt (depends on what you are exactly doing with them).

Since the problem is that the socket buffer fills, it surprised me that the server didn't increase SO_RCVBUF. That's not a solution but should help (already set in /proc/sys/net/core/rmem_default ?).

The real issue is: what are you placing on your pipes that are so slow to read from them? Optimizing those scripts could be a simpler solution. Wouldn't be hard to make the pipe writes non-blocking, properly blaming the slow pipes that couldn't be written

Domas Mituzas

5:08 p.m.

Hi!

...

Going multithread is really easy for a socket listener.

Really? :)

...

However, not so much in the LogProcessors. If they are shared accross threads, you may end up with all threads blocked in the fwrite and if they aren't shared, the files may easily corrupt (depends on what you are exactly doing with them).

I don't really understand what you say ;-) Do you mean lost data as 'corrupt'?

...

Since the problem is that the socket buffer fills, it surprised me that the server didn't increase SO_RCVBUF. That's not a solution but should help (already set in /proc/sys/net/core/rmem_default ?).

It is long term CPU saturation issue - mux process isn't fast enough to handle 16 output streams. Do note, there're quite a few events a second :)

...

The real issue is: what are you placing on your pipes that are so slow to read from them? Optimizing those scripts could be a simpler solution.

No, those scripts are not the bottleneck, there's plenty of CPU available, and they are not blocking (for too long, everything is blocking for a certain amount of time ;-)

...

Wouldn't be hard to make the pipe writes non-blocking, properly blaming the slow pipes that couldn't be written

There are no slow pipes. Bottleneck is udp2log step.

Domas

Platonides

12 Aug 12 Aug

9:35 a.m.

Domas Mituzas wrote:

...

Hi!

...
Going multithread is really easy for a socket listener.

Really? :)

Sure. Make each thread call accept and let the kernel give incoming sockets to one of them. There you have the listener done :) Solaris used to need an explicit locking, but it is now fixed there, too.

...

...
However, not so much in the LogProcessors. If they are shared accross threads, you may end up with all threads blocked in the fwrite and if they aren't shared, the files may easily corrupt (depends on what you are exactly doing with them).

I don't really understand what you say ;-) Do you mean lost data as 'corrupt'?

Given the following incomint events: udp2log has problems jeluf created a new wiki domas fixed the server

I call corrupted this: jeluf domas udp2log has fixed the server problems created a new wiki

...

...
Since the problem is that the socket buffer fills, it surprised me that the server didn't increase SO_RCVBUF. That's not a solution but should help (already set in /proc/sys/net/core/rmem_default ?).

It is long term CPU saturation issue - mux process isn't fast enough to handle 16 output streams. Do note, there're quite a few events a second :)

...
The real issue is: what are you placing on your pipes that are so slow to read from them? Optimizing those scripts could be a simpler solution.

No, those scripts are not the bottleneck, there's plenty of CPU available, and they are not blocking (for too long, everything is blocking for a certain amount of time ;-)

...
Wouldn't be hard to make the pipe writes non-blocking, properly blaming the slow pipes that couldn't be written

There are no slow pipes. Bottleneck is udp2log step.

Domas

I don't get it. What is slow on it?

What it does is: 1) Get socket data 2) Split line into pieces 3) fwrite each line in 16 fds 4) Go to 1

If there's plenty of CPU, the pipes doesn't fill, the fwrite doesn't block... Why isn't it coping with it? Too much time lost in context changes?

Domas Mituzas

7:54 p.m.

Hi!

...

Sure. Make each thread call accept and let the kernel give incoming sockets to one of them. There you have the listener done :) Solaris used to need an explicit locking, but it is now fixed there, too.

Heh, I somewhat ignored this way - yeah, it would work just fine - one can do per-file synchronization rather than per-event, as there's not much state involved on either side.

...

Given the following incomint events: udp2log has problems jeluf created a new wiki domas fixed the server

I call corrupted this: jeluf domas udp2log has fixed the server problems created a new wiki

Well, you wouldn't want to use fwrite/etc calls, as their behavior in threaded environment isn't that useful :) write()s aren't atomic either, so... what you have to do is:

lock(fd); write(); write(); write(); (may be needed for single buffer, in case first write is not complete) unlock(fd);

...

I don't get it. What is slow on it?

What it does is:

Get socket data

Split line into pieces

fwrite each line in 16 fds

Go to 1

1) Get socket data 2) Split packet into lines 3) Write lines into 16 fds 4) Go to 1

...

If there's plenty of CPU, the pipes doesn't fill, the fwrite doesn't block... Why isn't it coping with it? Too much time lost in context changes?

There're no context changes, as it running fully on a core. "plenty of CPU" is 100% core use, most of time is spent in write(), and apparently syscalls aren't free.

Domas

Magnus Manske

8:11 p.m.

On Thu, Aug 12, 2010 at 10:54 AM, Domas Mituzas midom.lists@gmail.com wrote:

...

There're no context changes, as it running fully on a core. "plenty of CPU" is 100% core use, most of time is spent in write(), and apparently syscalls aren't free.

Without having looked at any code, can't the threads just add data to a semaphore linked list (fast), and a single separate thread writes the stuff to disk occasionally?

Magnus

Domas Mituzas

13 Aug 13 Aug

3:16 p.m.

...

Without having looked at any code, can't the threads just add data to a semaphore linked list (fast), and a single separate thread writes the stuff to disk occasionally?

Isn't that the usual error that threaded software developers do:

1. get all threads depend on single mutex 2. watch them fight! (you'd get a million wakeups here a second :-)

as a bonus point you get a need to copy data to a separate buffer or frenzy memory allocating with another mutex for malloc/free ;-)

Domas

Magnus Manske

10:55 p.m.

On Fri, Aug 13, 2010 at 6:16 AM, Domas Mituzas midom.lists@gmail.com wrote:

...

...
Without having looked at any code, can't the threads just add data to a semaphore linked list (fast), and a single separate thread writes the stuff to disk occasionally?

Isn't that the usual error that threaded software developers do:

get all threads depend on single mutex

watch them fight! (you'd get a million wakeups here a second :-)

But adding to a linked list (essentially, changing a pointer) is about the cheapest operation there is...

...

as a bonus point you get a need to copy data to a separate buffer or frenzy memory allocating with another mutex for malloc/free ;-)

Disk dump thread: * Get mutex for list start pointer * Copy list start pointer * Reset list start pointer = NULL * Release mutex * Write list to disk * Release memory

If you allocate memory per list item, the freed ones should nicely fit the next ones, so malloc would not be too slow, I imagine (just use char[] of fixed size).

Just curious: is the "million wakeups" an actual number, or a figure of speech? How many views/sec are there?

Cheers, Magnus

Aryeh Gregor

14 Aug 14 Aug

3:56 a.m.

On Fri, Aug 13, 2010 at 8:55 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Disk dump thread:

Get mutex for list start pointer

Copy list start pointer

Reset list start pointer = NULL

Release mutex

Write list to disk

Release memory

If you allocate memory per list item, the freed ones should nicely fit the next ones, so malloc would not be too slow, I imagine (just use char[] of fixed size).

While we're having fun speculating on possible designs without actually volunteering to write the code ;): wouldn't a circular buffer make more sense than a linked list? You could have one buffer per thread, to avoid locking for the end pointer, and then the thread that writes the logs to disk can just go around all the buffers in turn, grabbing the requests from the start, ordering them by timestamp, and writing them out in the correct order. I don't think this would need any locking at all, and if you make the buffer large enough that you don't have to worry about the dump thread not being able to keep up, you wouldn't even have any cache bouncing.

You could also avoid malloc()/free() issues by just preallocating all the buffers, with only the dump thread doing malloc()/free() during operation. If you keep enough buffer space for 100k requests total, and each request takes 1 KB, that's only 100 MB.

Now I want Domas to shoot holes in my idea too! :)

...

Just curious: is the "million wakeups" an actual number, or a figure of speech? How many views/sec are there?

Current #wikimedia-tech topic says that peak load is about 100k req/s. Assuming a million per second is a reasonable idea for future-proofing.

Tim Starling

15 Aug 15 Aug

2:07 p.m.

On 14/08/10 03:56, Aryeh Gregor wrote:

...

While we're having fun speculating on possible designs without actually volunteering to write the code ;): wouldn't a circular buffer make more sense than a linked list?

[...]

...

Now I want Domas to shoot holes in my idea too! :)

He might have shot holes in it if he hadn't suggested it 5 days ago:

<domas> my armchair architect idea was writing to some cyclical shared buffer <domas> and allowing other threads to pick stuff from it

The tricky thing would be efficiently synchronising updates to the queue, so that reads and writes are atomic. You wouldn't need to do many syscalls in the main thread to make it slower than the current version.

-- Tim Starling

Aryeh Gregor

16 Aug 16 Aug

3:28 a.m.

On Sun, Aug 15, 2010 at 12:07 AM, Tim Starling tstarling@wikimedia.org wrote:

...

He might have shot holes in it if he hadn't suggested it 5 days ago:

<domas> my armchair architect idea was writing to some cyclical shared buffer <domas> and allowing other threads to pick stuff from it

The tricky thing would be efficiently synchronising updates to the queue, so that reads and writes are atomic. You wouldn't need to do many syscalls in the main thread to make it slower than the current version.

Have one circular buffer per write thread. The pointer to the front of each buffer would only be written by the thread that writes to that buffer, after it's written the new entry to the buffer. The pointer to the back would only be written by the one thread that reads all the buffers. Since each pointer is a single word long and is only written to by a single thread, no locks are needed anywhere, and no more syscalls than at present.

5239

Age (days ago)

5244

Last active (days ago)

wikitech-l@lists.wikimedia.org

17 comments

10 participants

tags (0)

participants (10)

Aryeh Gregor
Daniel Kinzler
Domas Mituzas
Magnus Manske
Mark Bergsma
Platonides
Rob Lanphier
Rob Lanphier
Robert Rohde
Tim Starling