I'm working on testing Kafka broker failover, to see if and how many messages are lost in the case that a broker dies while producers are sending messages. Here's what I'm doing to test. It's nothing rigorous, just a try it and see what happens. Oh and here's a gist with the commands and scripts I'm using: https://gist.github.com/3286692
1) Start up a broker on each an03 and an04: $ bin/kafka-server-start.sh config/server.properties
2) Start producing on an03: $ ./sequence_generate.sh 20000 10000 | bin/kafka-console-producer.sh --zookeeper analytics1003:2181 --topic test6
3) While logs are being produced, kill a Kafka broker.
4) Produce some more, and at some point while still producing, re-start the downed broker: $ bin/kafka-server-start.sh config/server.properties
5) Finish producing. Kill producer to make sure it doesn't have anything batched. (Should only matter with an asynchronous producer).
6) Start consumer on an04, saving stdout: bin/kafka-consumer-shell.sh --props config/consumer.properties --topic test6 | tee -a /tmp/test6.log
7) Check logs to makes sure all messages made it through: cat /tmp/test6.log | egrep '^consumed:' | awk '{print $3}' | sort -g | ~/bin/sequence_check.sh 1 20000
When I kill one of the brokers during message producing, I lose between 10-60 log messages (I think depending on the producer type, async vs. sync) in the final consumed output. I'm sure this is from the producer firing off messages to the assigned downed broker before ZooKeeper can notify the producer of the the pool reconfiguration.
So, I guess this is kind of as expected. In my most recent test, I lost 56 messages when I killed the broker. All of those messages were generated during the same second. Including those 56, there were about 500 messages fired off during that second. I don't have millisecond times on my messages right now (maybe I should add that in). As would be expected, no messages were lost when I brought the second broker back online. ZooKeeper reconfigured and the producer rerouted with no problems there.
Summary:
In a 2 node broker pool with one producer, it takes Kafka/Zookeeper less than a second to notice a failed broker and to reroute messages.
Is this acceptable? We will discuss :)
-Ao
I don't have millisecond times on my messages right now (maybe I should add that in).
Update: I've done just that.
consumed: 2012-08-07_18:33:47.128786953 2393 consumed: 2012-08-07_18:33:47.724846361 2733
In a single async producer, two broker setup, when taking down a single broker, I lose half of the messages sent for about 600 ms. If we had 10 brokers and one went, I would expect to lose 1/10th of messages for about 600ms. It might take longer for producers to reconfigure with more brokers and producers though.
Next steps: - Test failover with single producer and 10 brokers. - Test failover with 10 producers and 10 brokers.
On Aug 7, 2012, at 12:21 PM, Andrew Otto otto@wikimedia.org wrote:
I'm working on testing Kafka broker failover, to see if and how many messages are lost in the case that a broker dies while producers are sending messages. Here's what I'm doing to test. It's nothing rigorous, just a try it and see what happens. Oh and here's a gist with the commands and scripts I'm using: https://gist.github.com/3286692
- Start up a broker on each an03 and an04:
$ bin/kafka-server-start.sh config/server.properties
- Start producing on an03:
$ ./sequence_generate.sh 20000 10000 | bin/kafka-console-producer.sh --zookeeper analytics1003:2181 --topic test6
While logs are being produced, kill a Kafka broker.
Produce some more, and at some point while still producing, re-start the downed broker:
$ bin/kafka-server-start.sh config/server.properties
Finish producing. Kill producer to make sure it doesn't have anything batched. (Should only matter with an asynchronous producer).
Start consumer on an04, saving stdout:
bin/kafka-consumer-shell.sh --props config/consumer.properties --topic test6 | tee -a /tmp/test6.log
- Check logs to makes sure all messages made it through:
cat /tmp/test6.log | egrep '^consumed:' | awk '{print $3}' | sort -g | ~/bin/sequence_check.sh 1 20000
When I kill one of the brokers during message producing, I lose between 10-60 log messages (I think depending on the producer type, async vs. sync) in the final consumed output. I'm sure this is from the producer firing off messages to the assigned downed broker before ZooKeeper can notify the producer of the the pool reconfiguration.
So, I guess this is kind of as expected. In my most recent test, I lost 56 messages when I killed the broker. All of those messages were generated during the same second. Including those 56, there were about 500 messages fired off during that second. I don't have millisecond times on my messages right now (maybe I should add that in). As would be expected, no messages were lost when I brought the second broker back online. ZooKeeper reconfigured and the producer rerouted with no problems there.
Summary:
In a 2 node broker pool with one producer, it takes Kafka/Zookeeper less than a second to notice a failed broker and to reroute messages.
Is this acceptable? We will discuss :)
-Ao
Update!
- Test failover with single producer and 10 brokers.
consumed: 2012-08-07_18:33:50.846859719 4316 consumed: 2012-08-07_18:33:51.629561192 4709
In a single producer, 10 broker setup, I lost 42 out of 393, about 10% of messages, as expected. It took ~800ms for ZooKeeper to notify the producer and for the producer to reconfigure.
- Test failover with 10 producers and 10 brokers.
I've put the results here: https://gist.github.com/3286692#file_results%20for%2010%20producers%2C%2010%...
You can read the details if you like. In summary, with 10 brokers and 10 producers, when one of the brokers failed, there was still only about 10% loss of messages from each producer for an average of 0.75 seconds.
ZooKeeper reconfiguration time seems constant, independent of the number of producers and brokers. I can't be sure that this is true, since I'm only testing with 10 producers. If we use this in production, we'll have hundreds of producers.
Since sent messages are (by default) load balanced between all brokers, the more brokers we have, the fewer messages we'd lose if one goes down. The scale here looks pretty linear. Lose x % of brokers, and you lose x % of messages for the amount of time it takes ZooKeeper to reconfigure the producers.
On Aug 7, 2012, at 2:45 PM, Andrew Otto otto@wikimedia.org wrote:
I don't have millisecond times on my messages right now (maybe I should add that in).
Update: I've done just that.
consumed: 2012-08-07_18:33:47.128786953 2393 consumed: 2012-08-07_18:33:47.724846361 2733
In a single async producer, two broker setup, when taking down a single broker, I lose half of the messages sent for about 600 ms. If we had 10 brokers and one went, I would expect to lose 1/10th of messages for about 600ms. It might take longer for producers to reconfigure with more brokers and producers though.
Next steps:
- Test failover with single producer and 10 brokers.
- Test failover with 10 producers and 10 brokers.
On Aug 7, 2012, at 12:21 PM, Andrew Otto otto@wikimedia.org wrote:
I'm working on testing Kafka broker failover, to see if and how many messages are lost in the case that a broker dies while producers are sending messages. Here's what I'm doing to test. It's nothing rigorous, just a try it and see what happens. Oh and here's a gist with the commands and scripts I'm using: https://gist.github.com/3286692
- Start up a broker on each an03 and an04:
$ bin/kafka-server-start.sh config/server.properties
- Start producing on an03:
$ ./sequence_generate.sh 20000 10000 | bin/kafka-console-producer.sh --zookeeper analytics1003:2181 --topic test6
While logs are being produced, kill a Kafka broker.
Produce some more, and at some point while still producing, re-start the downed broker:
$ bin/kafka-server-start.sh config/server.properties
Finish producing. Kill producer to make sure it doesn't have anything batched. (Should only matter with an asynchronous producer).
Start consumer on an04, saving stdout:
bin/kafka-consumer-shell.sh --props config/consumer.properties --topic test6 | tee -a /tmp/test6.log
- Check logs to makes sure all messages made it through:
cat /tmp/test6.log | egrep '^consumed:' | awk '{print $3}' | sort -g | ~/bin/sequence_check.sh 1 20000
When I kill one of the brokers during message producing, I lose between 10-60 log messages (I think depending on the producer type, async vs. sync) in the final consumed output. I'm sure this is from the producer firing off messages to the assigned downed broker before ZooKeeper can notify the producer of the the pool reconfiguration.
So, I guess this is kind of as expected. In my most recent test, I lost 56 messages when I killed the broker. All of those messages were generated during the same second. Including those 56, there were about 500 messages fired off during that second. I don't have millisecond times on my messages right now (maybe I should add that in). As would be expected, no messages were lost when I brought the second broker back online. ZooKeeper reconfigured and the producer rerouted with no problems there.
Summary:
In a 2 node broker pool with one producer, it takes Kafka/Zookeeper less than a second to notice a failed broker and to reroute messages.
Is this acceptable? We will discuss :)
-Ao
More updates! David and I discussed experimenting with failover when ZooKeeper clusters of various sizes have problems as well.
- Single ZK, single producer, 10 brokers. Failure of ZK during production. All is well! No messages were lost, as expected.
- Single ZK, single producer, 10 brokers, single concurrent consumer. Failure of ZK during production/consumption. Consumers (by default) use ZK to keep track of what has they have already consumed. If the ZK pool is down, consumers stop consuming all together. In this case, when the ZK died, consumption stopped. Production continued, and since all of the brokers that the producer knew about before ZK died were still available, all of the messages made it to the brokers. I started ZK back up, and restarted consumption. All messages were consumed, so no loss. This is as expected.
- Single ZK, single producer, 3 brokers. Failure of ZK and then 1 broker during production. During production of messages, I stopped the single ZK, and then a single broker. Since ZK wasn't running, the producer had no way of being noticed about the broker failure. It just kept on sending messages to the downed broker. I started the ZK and the downed broker back up, and then consumed messages. I lost 33% of messages during the time that the broker was down, which is as expected (1/3 brokers was down.)
So far, everything has been going exactly as expected. Next up are tests with multiple ZooKeepers. I'm hooooping for decreased reconfiguration times when a broker fails. We will see!
On Aug 7, 2012, at 5:09 PM, Andrew Otto otto@wikimedia.org wrote:
Update!
- Test failover with single producer and 10 brokers.
consumed: 2012-08-07_18:33:50.846859719 4316 consumed: 2012-08-07_18:33:51.629561192 4709
In a single producer, 10 broker setup, I lost 42 out of 393, about 10% of messages, as expected. It took ~800ms for ZooKeeper to notify the producer and for the producer to reconfigure.
- Test failover with 10 producers and 10 brokers.
I've put the results here: https://gist.github.com/3286692#file_results%20for%2010%20producers%2C%2010%...
You can read the details if you like. In summary, with 10 brokers and 10 producers, when one of the brokers failed, there was still only about 10% loss of messages from each producer for an average of 0.75 seconds.
ZooKeeper reconfiguration time seems constant, independent of the number of producers and brokers. I can't be sure that this is true, since I'm only testing with 10 producers. If we use this in production, we'll have hundreds of producers.
Since sent messages are (by default) load balanced between all brokers, the more brokers we have, the fewer messages we'd lose if one goes down. The scale here looks pretty linear. Lose x % of brokers, and you lose x % of messages for the amount of time it takes ZooKeeper to reconfigure the producers.
On Aug 7, 2012, at 2:45 PM, Andrew Otto otto@wikimedia.org wrote:
I don't have millisecond times on my messages right now (maybe I should add that in).
Update: I've done just that.
consumed: 2012-08-07_18:33:47.128786953 2393 consumed: 2012-08-07_18:33:47.724846361 2733
In a single async producer, two broker setup, when taking down a single broker, I lose half of the messages sent for about 600 ms. If we had 10 brokers and one went, I would expect to lose 1/10th of messages for about 600ms. It might take longer for producers to reconfigure with more brokers and producers though.
Next steps:
- Test failover with single producer and 10 brokers.
- Test failover with 10 producers and 10 brokers.
On Aug 7, 2012, at 12:21 PM, Andrew Otto otto@wikimedia.org wrote:
I'm working on testing Kafka broker failover, to see if and how many messages are lost in the case that a broker dies while producers are sending messages. Here's what I'm doing to test. It's nothing rigorous, just a try it and see what happens. Oh and here's a gist with the commands and scripts I'm using: https://gist.github.com/3286692
- Start up a broker on each an03 and an04:
$ bin/kafka-server-start.sh config/server.properties
- Start producing on an03:
$ ./sequence_generate.sh 20000 10000 | bin/kafka-console-producer.sh --zookeeper analytics1003:2181 --topic test6
While logs are being produced, kill a Kafka broker.
Produce some more, and at some point while still producing, re-start the downed broker:
$ bin/kafka-server-start.sh config/server.properties
Finish producing. Kill producer to make sure it doesn't have anything batched. (Should only matter with an asynchronous producer).
Start consumer on an04, saving stdout:
bin/kafka-consumer-shell.sh --props config/consumer.properties --topic test6 | tee -a /tmp/test6.log
- Check logs to makes sure all messages made it through:
cat /tmp/test6.log | egrep '^consumed:' | awk '{print $3}' | sort -g | ~/bin/sequence_check.sh 1 20000
When I kill one of the brokers during message producing, I lose between 10-60 log messages (I think depending on the producer type, async vs. sync) in the final consumed output. I'm sure this is from the producer firing off messages to the assigned downed broker before ZooKeeper can notify the producer of the the pool reconfiguration.
So, I guess this is kind of as expected. In my most recent test, I lost 56 messages when I killed the broker. All of those messages were generated during the same second. Including those 56, there were about 500 messages fired off during that second. I don't have millisecond times on my messages right now (maybe I should add that in). As would be expected, no messages were lost when I brought the second broker back online. ZooKeeper reconfigured and the producer rerouted with no problems there.
Summary:
In a 2 node broker pool with one producer, it takes Kafka/Zookeeper less than a second to notice a failed broker and to reroute messages.
Is this acceptable? We will discuss :)
-Ao