I'm working on testing Kafka broker failover, to see if and how many messages are lost in the case that a broker dies while producers are sending messages.  Here's what I'm doing to test.  It's nothing rigorous, just a try it and see what happens.  Oh and here's a gist with the commands and scripts I'm using:  https://gist.github.com/3286692

1)  Start up a broker on each an03 and an04:
$ bin/kafka-server-start.sh config/server.properties

2)  Start producing on an03:
./sequence_generate.sh 20000 10000 | bin/kafka-console-producer.sh --zookeeper analytics1003:2181 --topic test6

3)  While logs are being produced, kill a Kafka broker.

4)  Produce some more, and at some point while still producing, re-start the downed broker:
$ bin/kafka-server-start.sh config/server.properties

5) Finish producing.  Kill producer to make sure it doesn't have anything batched.  (Should only matter with an asynchronous producer).

6) Start consumer on an04, saving stdout:
bin/kafka-consumer-shell.sh --props config/consumer.properties --topic test6 | tee -a /tmp/test6.log

7) Check logs to makes sure all messages made it through:
cat /tmp/test6.log | egrep '^consumed:' | awk '{print $3}' | sort -g | ~/bin/sequence_check.sh 1 20000



When I kill one of the brokers during message producing, I lose between 10-60 log messages (I think depending on the producer type, async vs. sync) in the final consumed output.  I'm sure this is from the producer firing off messages to the assigned downed broker before ZooKeeper can notify the producer of the the pool reconfiguration.  


So, I guess this is kind of as expected.  In my most recent test, I lost 56 messages when I killed the broker.  All of those messages were generated during the same second.  Including those 56, there were about 500 messages fired off during that second.  I don't have millisecond times on my messages right now (maybe I should add that in).  As would be expected, no messages were lost when I brought the second broker back online.  ZooKeeper reconfigured and the producer rerouted with no problems there.

Summary: 

In a 2 node broker pool with one producer, it takes Kafka/Zookeeper less than a second to notice a failed broker and to reroute messages.  

Is this acceptable?  We will discuss :)

-Ao